JSON Serialization and the Unicode Escape Mystery


One of the beautiful things about the digital age is that data has become universal, crossing borders, languages, and scripts. With such diversity, it’s important to handle international characters correctly in our applications. This brings us to a small mystery that many developers encounter when working with the json module in Python: the case of the unexpected Unicode escape sequences.

The Puzzle

You’ve been there. You’re serializing a dictionary containing some non-English text into a JSON string in Python:

import json

data = {
"greeting": "你好"
}


json_string = json.dumps(data)
print(json_string) # Outputs: {"greeting": "\u4f60\u597d"}

Instead of the actual Chinese characters “你好”, you find yourself staring at the Unicode escape sequences “\u4f60\u597d”. Why did this happen? Is it a bug?

Delving into the Behavior

The json.dumps() function in Python comes with a parameter, ensure_ascii, which by default is set to True. This means that during the serialization process, any non-ASCII characters will be escaped into their Unicode representation to ensure that the resulting JSON string is pure ASCII.

The benefit of this behavior is clear: ASCII strings are generally more portable and less error-prone when interfacing with various systems or software. For instance, older systems or certain network protocols might not handle non-ASCII characters well.

The Solution: ensure_ascii=False

If you want your JSON string to retain the actual non-ASCII characters, you can set the ensure_ascii parameter to False:

json_string = json.dumps(data, ensure_ascii=False)
print(json_string) # Outputs: {"greeting": "你好"}

By using ensure_ascii=False, the output JSON will contain the actual characters without escaping them. This results in a more human-readable format, especially if you’re working with a lot of non-English content.

Precautions

Although setting ensure_ascii to False makes the output more readable, it’s essential to be aware of the destination or consumers of the JSON. Ensure that any systems or processes that consume the generated JSON can handle non-ASCII characters, especially if they’re being transmitted or stored.

For example, if the JSON data is being sent to a web service, ensure that the service expects and can handle UTF-8 encoded data. Or, if the data is being written to a file, make sure to save the file using an encoding like “utf-8”.


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC