Solving Unicode Problems in Python 2.7

By Derek Dohler on March 24th, 2014

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xd1 in position 1: ordinal not in range(128) (Why is this so hard??)

One of the toughest things to get right in a Python program is Unicode handling. If you’re reading this, you’re probably in the middle of discovering this the hard way.

The main reasons Unicode handling is difficult in Python is because the existing terminology is confusing, and because many cases which could be problematic are handled transparently. This prevents many people from ever having to learn what’s really going on, until suddenly they run into a brick wall when they want to handle data that contains characters outside the ASCII character set.

If you’ve just run into the Python 2 Unicode brick wall, here are three steps you can take to start thinking about strings and Unicode the right way:

1. str is for bytes, NOT strings

The first step toward solving your Unicode problem is to stop thinking of type< ‘str’> as storing strings (that is, sequences of human-readable characters, a.k.a. text). Instead, start thinking of type< ‘str’> as a container for bytes. Objects of type< ‘str’> are in fact perfectly happy to store arbitrary byte sequences.

To get yourself started, take a look at the string literals in your code. Every time you see ‘abc’, “abc”, or “””abc”””, say to yourself “That’s a sequence of 3 bytes corresponding to the ASCII codes for the letters a, b, and c” (technically, it’s UTF-8, but ASCII and UTF-8 are the same for Latin letters.

2. unicode is for strings

The second step toward solving your problem is to start using type< ‘unicode’> as your go-to container for strings.

For starters, that means using the “u” prefix for literals, which will create objects of type< ‘unicode’> rather than regular quotes, which will create objects of type< ‘str’> (don’t bother with the docstrings; you’ll rarely have to manipulate them yourself, which is where problems usually happen). There are some other good practices which I’ll discuss below.

3. UTF-8, UTF-16, and UTF-32 are serialization formats — NOT Unicode

UTF-8 is an encoding, just like ASCII (more on encodings below), which is represented with bytes. The difference is that the UTF-8 encoding can represent every Unicode character, while the ASCII encoding can’t. But they’re both still bytes. By contrast, an object of type< ‘unicode’> is just that — a Unicode object. It isn’t encoded or represented by any particular sequence of bytes. You can think of Unicode objects as storing abstract, Platonic representations of text, while ASCII, UTF-8, UTF-16, etc. are different ways of serializing (encoding) your text.

Okay, but why can’t I use str for strings? (Detailed problem description)

The reason for going through the mind-shift above is that since type< ‘str’> stores bytes, it has an implicit encoding, and encodings (and/or attempts to decode the wrong encoding) cause the majority of Unicode problems in Python 2.

What do I mean by encoding? It’s the sequence of bits used to represent the characters that we read. That is, the “abc” string from above is actually being stored like this: 01100001 0100010 01100011.

But there are other ways to store “abc” — if you store it in UTF-8, it looks exactly like the ASCII version because UTF-8 and ASCII are the same for Latin letters. But if you store “abc” in UTF-16, you get 0000000001100001 0000000001100010 0000000001100011.

Encodings are important because you have to use them whenever text travels outside the bounds of your program–if you want to write a string to a file, or send it over a network, or store it in a database, it needs to have an encoding. And if you send out the wrong encoding (that is, a byte sequence that your receiver doesn’t expect), you’ll get Unicode errors.

The problem with type< ‘str’>, and the main reason why Unicode in Python 2.7 is confusing, is that the encoding of a given instance of type< ‘str’> is implicit. This means that the only way to discover the encoding of a given instance of type< ‘str’> is to try and decode the byte sequence, and see if it explodes. Unfortunately, there are lots of places where byte sequences get invisibly decoded, which can cause confusion and problems. Here are some example lines to demonstrate:

# Set up the variables we'll use
>>> uni_greeting = u'Hi, my name is %s.'
>>> utf8_greeting = uni_greeting.encode('utf-8')

>>> uni_name = u'José'  # Note the accented e.
>>> utf8_name = uni_name.encode('utf-8')

# Plugging a Unicode into another Unicode works fine
>>> uni_greeting % uni_name
u'Hi, my name is Josxe9.'

# Plugging UTF-8 into another UTF-8 string works too
>>> utf8_greeting % utf8_name
'Hi, my name is Josxc3xa9.'

# You can plug Unicode into a UTF-8 byte sequence...
>>> utf8_greeting % uni_name  # UTF-8 invisibly decoded into Unicode; note the return type
u'Hi, my name is Josxe9.'

# But plugging a UTF-8 string into a Unicode doesn't work so well...
>>> uni_greeting % utf8_name  # Invisible decoding doesn't work in this direction.
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

# Unless you plug in ASCII-compatible data, that is.
>>> uni_greeting % u'Bob'.encode('utf-8')
u'Hi, my name is Bob.'

# And you can forget about string interpolation completely if you're using UTF-16.
>>> uni_greeting.encode('utf-16') % uni_name
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: unsupported format character '' (0x0) at index 33

# Well, you can interpolate utf-16 into utf-8 because these are just byte sequences
>>> utf8_greeting % uni_name.encode('utf-16')  # But this is a useless mess
'Hi, my name is xffxfeJx00ox00sx00xe9x00.'

The examples above should show you why using type< ‘str’> is problematic; invisible decoding coupled with the implicit encodings for type< ‘str’> can hide serious problems. Everything will work just fine as long as your code handles strictly ASCII data. Then, one day, a hapless “é” will blunder into your input. Code which implicitly assumes (and invisibly decodes) ASCII-encoded input will suddenly have to contend with UTF-8-encoded data, and the whole thing can blow up; even your exception handlers may start throwing UnicodeDecodeErrors.

Solution: The Unicode ‘airlock’

The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence.

The most systematic way to accomplish this is to make your code into a Unicode-only clean room. That is, your code should only use Unicode objects internally; you may even want to put checks for type< ‘unicode’> in key places to keep yourself honest.
Then, put ‘airlocks’ at the entry points to your code which will ensure that any byte sequence attempting to enter your code is properly clothed in a protective Unicode bunny suit before being allowed inside.

For example:

with f = open('file.txt'):  # BAD--gives you bytes
    ...
with f = codecs.open('file.txt', encoding='utf-8'):  # GOOD--gives you Unicode
    ...

This might sound slow and cumbersome, but it’s actually pretty easy; most well-known Python libraries follow this practice already, so you usually only need to worry about input coming from files, network requests, etc.

Airlock Construction Kit (Useful Unicode tools)

Nearly every Unicode problem can be solved by the proper application of these tools; they will help you build an airlock to keep the inside of your code nice and clean:

  • encode(): Gets you from Unicode -> bytes
  • decode(): Gets you from bytes -> Unicode
  • codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
  • u”: Makes your string literals into Unicode objects rather than byte sequences.

Warning: Don’t use encode() on bytes or decode() on Unicode objects.

Troubleshooting

The key to troubleshooting Unicode errors in Python is to know what types you have. Then, try these steps:

  1. If some variables are byte sequences instead of Unicode objects, convert them to Unicode objects with decode() / u” before handling them.

    >>> uni_greeting % utf8_name
    Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
    # Solution:
    >>> uni_greeting % utf8_name.decode('utf-8')
    u'Hi, my name is Josxe9.'
  2. If all variables are byte sequences, there is probably an encoding mismatch; convert everything to Unicode objects with decode() / u” and try again.

  3. If all variables are already Unicode, then part of your code may not know how to deal with Unicode objects; either fix the code, or encode to a byte sequence before sending the data (and make sure to decode any return values back to Unicode):

    >>> with open('test.out', 'wb') as f:
    >>>     f.write(uni_name)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 3: ordinal not in range(128)
    # Solution:
    >>> f.write(uni_name.encode('utf-8'))
    # Better Solution:
    >>> with codecs.open('test.out', 'w', encoding='utf-8') as f:
    >>>     f.write(uni_name)

Other points

Python 3 solves this problem by becoming more explicit: string literals are now Unicode by default, while byte sequences are stored in a new type called ‘byte’.

For a much more thorough look at these issues, take a look at http://docs.python.org/2/howto/unicode.html .

Good luck!


Want to join our team?

We’re committed to applying geospatial technology for civic and social impact while advancing the state-of-the-art through research.

Learn more about our project work and sign up for job notifications to be the first to hear about open positions!