Azavea Labs

Where software engineering meets GIS.

Solving Unicode Problems in Python 2.7

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xd1 in position 1: ordinal not in range(128) (Why is this so hard??)

One of the toughest things to get right in a Python program is Unicode handling. If you’re reading this, you’re probably in the middle of discovering this the hard way.

The main reasons Unicode handling is difficult in Python is because the existing terminology is confusing, and because many cases which could be problematic are handled transparently. This prevents many people from ever having to learn what’s really going on, until suddenly they run into a brick wall when they want to handle data that contains characters outside the ASCII character set.

If you’ve just run into the Python 2 Unicode brick wall, here are three steps you can take to start thinking about strings and Unicode the right way:

1. str is for bytes, NOT strings

The first step toward solving your Unicode problem is to stop thinking of type< ‘str’> as storing strings (that is, sequences of human-readable characters, a.k.a. text). Instead, start thinking of type< ‘str’> as a container for bytes. Objects of type< ‘str’> are in fact perfectly happy to store arbitrary byte sequences.

To get yourself started, take a look at the string literals in your code. Every time you see ‘abc’, “abc”, or “”"abc”"”, say to yourself “That’s a sequence of 3 bytes corresponding to the ASCII codes for the letters a, b, and c” (technically, it’s UTF-8, but ASCII and UTF-8 are the same for Latin letters.

2. unicode is for strings

The second step toward solving your problem is to start using type< ‘unicode’> as your go-to container for strings.

For starters, that means using the “u” prefix for literals, which will create objects of type< ‘unicode’> rather than regular quotes, which will create objects of type< ‘str’> (don’t bother with the docstrings; you’ll rarely have to manipulate them yourself, which is where problems usually happen). There are some other good practices which I’ll discuss below.

3. UTF-8, UTF-16, and UTF-32 are serialization formats — NOT Unicode

UTF-8 is an encoding, just like ASCII (more on encodings below), which is represented with bytes. The difference is that the UTF-8 encoding can represent every Unicode character, while the ASCII encoding can’t. But they’re both still bytes. By contrast, an object of type< ‘unicode’> is just that — a Unicode object. It isn’t encoded or represented by any particular sequence of bytes. You can think of Unicode objects as storing abstract, Platonic representations of text, while ASCII, UTF-8, UTF-16, etc. are different ways of serializing (encoding) your text.

Okay, but why can’t I use str for strings? (Detailed problem description)

The reason for going through the mind-shift above is that since type< ‘str’> stores bytes, it has an implicit encoding, and encodings (and/or attempts to decode the wrong encoding) cause the majority of Unicode problems in Python 2.

What do I mean by encoding? It’s the sequence of bits used to represent the characters that we read. That is, the “abc” string from above is actually being stored like this: 01100001 0100010 01100011.

But there are other ways to store “abc” — if you store it in UTF-8, it looks exactly like the ASCII version because UTF-8 and ASCII are the same for Latin letters. But if you store “abc” in UTF-16, you get 0000000001100001 0000000001100010 0000000001100011.

Encodings are important because you have to use them whenever text travels outside the bounds of your program–if you want to write a string to a file, or send it over a network, or store it in a database, it needs to have an encoding. And if you send out the wrong encoding (that is, a byte sequence that your receiver doesn’t expect), you’ll get Unicode errors.

The problem with type< ‘str’>, and the main reason why Unicode in Python 2.7 is confusing, is that the encoding of a given instance of type< ‘str’> is implicit. This means that the only way to discover the encoding of a given instance of type< ‘str’> is to try and decode the byte sequence, and see if it explodes. Unfortunately, there are lots of places where byte sequences get invisibly decoded, which can cause confusion and problems. Here are some example lines to demonstrate:

# Set up the variables we'll use
>>> uni_greeting = u'Hi, my name is %s.'
>>> utf8_greeting = uni_greeting.encode('utf-8')

>>> uni_name = u'José'  # Note the accented e.
>>> utf8_name = uni_name.encode('utf-8')

# Plugging a Unicode into another Unicode works fine
>>> uni_greeting % uni_name
u'Hi, my name is Jos\xe9.'

# Plugging UTF-8 into another UTF-8 string works too
>>> utf8_greeting % utf8_name
'Hi, my name is Jos\xc3\xa9.'

# You can plug Unicode into a UTF-8 byte sequence...
>>> utf8_greeting % uni_name  # UTF-8 invisibly decoded into Unicode; note the return type
u'Hi, my name is Jos\xe9.'

# But plugging a UTF-8 string into a Unicode doesn't work so well...
>>> uni_greeting % utf8_name  # Invisible decoding doesn't work in this direction.
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

# Unless you plug in ASCII-compatible data, that is.
>>> uni_greeting % u'Bob'.encode('utf-8')
u'Hi, my name is Bob.'

# And you can forget about string interpolation completely if you're using UTF-16.
>>> uni_greeting.encode('utf-16') % uni_name
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: unsupported format character '' (0x0) at index 33

# Well, you can interpolate utf-16 into utf-8 because these are just byte sequences
>>> utf8_greeting % uni_name.encode('utf-16')  # But this is a useless mess
'Hi, my name is \xff\xfeJ\x00o\x00s\x00\xe9\x00.'

The examples above should show you why using type< ‘str’> is problematic; invisible decoding coupled with the implicit encodings for type< ‘str’> can hide serious problems. Everything will work just fine as long as your code handles strictly ASCII data. Then, one day, a hapless “é” will blunder into your input. Code which implicitly assumes (and invisibly decodes) ASCII-encoded input will suddenly have to contend with UTF-8-encoded data, and the whole thing can blow up; even your exception handlers may start throwing UnicodeDecodeErrors.

Solution: The Unicode ‘airlock’

The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence.

The most systematic way to accomplish this is to make your code into a Unicode-only clean room. That is, your code should only use Unicode objects internally; you may even want to put checks for type< ‘unicode’> in key places to keep yourself honest.
Then, put ‘airlocks’ at the entry points to your code which will ensure that any byte sequence attempting to enter your code is properly clothed in a protective Unicode bunny suit before being allowed inside.

For example:

with f = open('file.txt'):  # BAD--gives you bytes
    ...
with f = codecs.open('file.txt', encoding='utf-8'):  # GOOD--gives you Unicode
    ...

This might sound slow and cumbersome, but it’s actually pretty easy; most well-known Python libraries follow this practice already, so you usually only need to worry about input coming from files, network requests, etc.

Airlock Construction Kit (Useful Unicode tools)

Nearly every Unicode problem can be solved by the proper application of these tools; they will help you build an airlock to keep the inside of your code nice and clean:

  • encode(): Gets you from Unicode -> bytes
  • decode(): Gets you from bytes -> Unicode
  • codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
  • u”: Makes your string literals into Unicode objects rather than byte sequences.

Warning: Don’t use encode() on bytes or decode() on Unicode objects.

Troubleshooting

The key to troubleshooting Unicode errors in Python is to know what types you have. Then, try these steps:

  1. If some variables are byte sequences instead of Unicode objects, convert them to Unicode objects with decode() / u” before handling them.

    >>> uni_greeting % utf8_name
    Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
    # Solution:
    >>> uni_greeting % utf8_name.decode('utf-8')
    u'Hi, my name is Jos\xe9.'
  2. If all variables are byte sequences, there is probably an encoding mismatch; convert everything to Unicode objects with decode() / u” and try again.

  3. If all variables are already Unicode, then part of your code may not know how to deal with Unicode objects; either fix the code, or encode to a byte sequence before sending the data (and make sure to decode any return values back to Unicode):

    >>> with open('test.out', 'wb') as f:
    >>>     f.write(uni_name)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
    # Solution:
    >>> f.write(uni_name.encode('utf-8'))
    # Better Solution:
    >>> with codecs.open('test.out', 'w', encoding='utf-8') as f:
    >>>     f.write(uni_name)

Other points

Python 3 solves this problem by becoming more explicit: string literals are now Unicode by default, while byte sequences are stored in a new type called ‘byte’.

For a much more thorough look at these issues, take a look at http://docs.python.org/2/howto/unicode.html .

Good luck!

Exporting Django Querysets to CSV

At Azavea, we have numerous client projects that must provide exportable data in CSV format. The reasons for this range from simple data exchange to complex export-modify-import workflows. In order to make this process easier for django projects, we made a simple utility, django-queryset-csv, for exporting django querysets directly to HTTP responses with CSVs attached.

7gr69

So you have something like this:

installation:

pip install django-queryset-csv

models.py:

views.py:

Pain Points

Why bother? Can’t you write a for-loop to export a CSV in a dozen or so lines of code? It turns out there are a few pain points we run into over and over again with CSV exports:

  • We’re currently using python 2.7 for our django projects, where the provided CSV library has poor support for unicode characters. This has to be addressed somehow, usually by utf-8 encoding python strings before writing them to the CSV.
  • Adding a BOM character to CSVs with utf8 encoding is required for them to open properly in Microsoft Excel.

These are delicate behaviors that we prefer to have handled by a dedicated library with adequate unit test coverage.

Utilities

In addition, we found ourselves repeatedly wanting the same helper utilities, and to have them work together consistently and predictably.

  • The ability to generate a filename automatically based on the underlying data.

  • The ability to generate timestamps for each export

  • The ability to generate column headers from the underlying data, sane defaults, custom overrides, or some flexible combination of the three.




    In this case, field_header_map takes precedence if provided, followed by verbose_name if specified and not disabled, followed finally by the underlying model field names.

Advanced Usage

Moving this library into production, we quickly discovered some advanced features that we needed the library to support.

Foreign Keys

The most obvious is foreign key fields. This is supported using the .values() method of a queryset, which is able to walk relationships in the same fashion as other ORM directives, using double-underscores. Note that you can’t make use of verbose names in this case, but you can provide custom overrides in the field_header_map argument:

Asynchronous Export

Sometimes, you can’t return a CSV response in the same thread. We ran into this problem because sometimes CSV exports, or the queries that produce them, take too long for the request/response cycle, and tying down a web worker for that period is unnacceptable. For this case, django-queryset-csv provides a lower-level function for writing CSV data to a file-like object of any kind:

Final Thoughts

We’re using django-queryset-csv in production, but the project remains in beta. We hope it’ll make CSV exports a little less painful for you too. Please report all bugs via github.

GeoTrellis 0.9 is out

avalon

The Legendary Island of Avalon (image credit)

 

The GeoTrellis team is very excited to announce the availability of GeoTrellis 0.9 (codename “Avalon”), a significant new release that is a big step forward towards our goal of a general purpose, high performance raster geoprocessing library and runtime designed to perform and scale for the web.

First of all, we’ve significantly revised the documentation sitehttp://geotrellis.io/.  Props to Rob Emanuele for the new site and the Azavea design team for revised styling.  The new site includes both case studies and some samples we’ve developed since the 0.8 release.  There is a full set of release notes available, but here are some highlights:

      • API Refactor: We’re moving away from requiring users to manually create operations and pass in rasters as arguments, and instead having objects called ‘DataSources’ that represent the source of data, with the operations to transform or combine that data as methods on those objects. These methods are not stateful, they return new DataSources per method call. Similar to the ‘Future’-like model of Operations, transformations on a DataSource are not actually run until the server is told to ‘run’ the source, either explicitly or through a method call on DataSource and an implicit server parameter. Special thanks to joshmarcus for his vision of this API change and all the work he put into making it happen. This API change also means that any code that currently runs on an 0.8 release will probably be very broken. The ability to create and run Op[T]‘s is still there, but some of the functionality, especially dealing with parallelizing over tiles, was stripped from them. Let us know on the mailing list or in #geotrellis on freenode IRC if you’re upgrading and we’ll lend a hand with the transition
      • File I/O: Reading ARGs from disk has been sped up, and in some cases, you’ll find improvements of an order of magnitude or more.
      • Spray.io:  Replaced Jetty with Spray.io, a fast HTTP server for Akka Actors
      • Tile operation improvements: Running multiple operations over tiled data has been greatly improved. For example, if you were to multiply a raster by an integer, add the result to another tiled raster, and then run a zonal summary (such as taking the mean of values within a polygon) on that result, GeoTrellis 0.8 would unnecessarily reduce to a whole raster in memory between the different transformations (see issue #517). In 0.9, you’ll get the desired behavior, where the multiplication, addition, and zonal summary all all done in parallel per tile, before the final zonal summary result is created from the reduction of the tile zonal summary results.
      • Clustering improvements: We took several steps to make it easier to distribute operations over a cluster using Akka clustering. There’s a .distribute call on DataSource which will distribute all of the operations of the DataSource’s elements across the cluster, based on the configuration.
      • Macros: A new geotrellis-macros project was created to deal with issue #624, based on the discussion of #324. This includes macros for checking whether a value is or isn’t NoData, independent of what type that data is. And these checks are inlined through macro magic, so there’s no performance hit for the nicer syntax.
      • Revised Operations: Added double support to Focal Standard Deviation, Focal Min, Focal Max, Focal Sum operations; added 8-neighbor connectivity to RegionGroup;
      • New Operations: Ordinary Kriging Interpolation, Hydrology operations (Fill, Flow Accumulations, Flow Direction), IDW Interpolation

Please let the team know — via the #geotrellis channel on Freenode IRC or the geotrellis-user Google Group mailing list — if you have any comments or suggestions.

 

Version 0.10 Plans

We’re hard at work on a GeoTrellis 0.10.  The major plans for this release include:

      • Integrate Apache Spark 
      • Support for operating on data stored in the Hadoop Distributed File System (HDFS)
      • Support for multi-band rasters
      • Develop a scala wrapper for JTS
      • Add more operations
      • Formal move to LocationTech working group at the Eclipse Foundation

 

More Info

 

GeoTrellis is released under the Apache2 license.

python-cicero: A Python Packaging Tutorial

This entry is part 2 of 2 in the series python-cicero and Python Packaging

As mentioned in my last Labs post, last month I released python-cicero, a Python wrapper library for Azavea’s Cicero API. You might recall me mentioning in that post that the Python packaging process is a bit of a mess. You might also remember the talk I gave about the project at the GeoPhilly Meetup group drew praise from some of the attendees for its conveyance of man versus machine conflict inherent in the packaging journey. If you took a look at those slides, you may have noticed pictures of rubber ducks and horses holding cats, which I shamelessly stole from another fantastic but exasperated-feeling programming talk which accurately captured my sentiments towards Python packages at the time:

Wat - wæt: n. The only proper response to something that makes absolutely no sense.

Wat – wæt: n. The only proper response to something that makes absolutely no sense.

We wrote our Python wrapper. We have docs. Even unit tests. One would think we’d be past most of the hurdles that stand between us and shipped Python code, but there is one final harder-than-it-should-be section of the journey to overcome: How do we make it so other people can install and use our wrapper?

The answer is to turn our wrapper into a Python package, and upload it to the Python Package Index. Why is this hard? One issue is the lack of clear, authoritative documentation on the process – part of the reason I’m writing this post. So it should come as no surprise our first obstacle is one of vocabulary.

Modules, Packages, and Distributions

At first, our wrapper is just a Python “module” – just some .py source files. Modules can be just one .py file, or “multi-file modules” like in our case – several .py files in a directory (in our case, “cicero“) with a special __init__.py file that tells Python to treat the whole directory as one module. As a multi-file module, users will be able to import everything necessary for the wrapper with a simple “from cicero import *”, or even “from cicero import CiceroRestConnection”.

To make it so others can easily download and install it, with either Python’s “easy_install” command or the far superior “pip”, we have to make our module a proper Python “package” and upload a version of that package (called a “distribution” file) to the Python Package Index.

Those three terms bear repeating. A module is one or many Python source files. A package is is one or many modules, as well as some supporting files which we’ll get into below. A distribution is an archive file (think tarball) of a package that is uploaded to PyPI and what your users will actually download from the internet and install with easy_install or pip.

Having gone through this process, I believe the Python community does not take sufficient care to distinguish among these three terms when discussing packaging. Often, Pythonistas will refer to pretty much everything as a “package”. This results in unnecessary confusion and contradiction for newcomers as they try to understand the already messy packaging process. ”pip” stands for Pip Installs Packages, when really it’s often downloading/installing distribution files. The Python Package Index is not called the Python Distribution Index, when it probably should be. Folks will refer to a directory of Python files as a package, when they probably really mean a multi-file module.

The Packaging Process

With our terminology settled, what are the “supporting files” I mentioned that go into a package? I’m glad you asked! Here’s a list of the key ones:

  • The modules to be packaged
  • A changelog – CHANGES.txt is the convention
  • A license if the package is open source – LICENSE.txt is the convention
  • A readme file written in reStructuredText, and that’s more than just a convention (see below)
  • A MANIFEST.in file
  • A setup.py file
  • Other non-essential but related files: documentation, example scripts, tests, etc

I’ll assume you know about changelogs and licenses and readme files – if not, they’re easy to find out about and no specific formatting is required for your package, it’s just “A Good Idea TM” to have them. However, the reason you should write your readme file in reStructuredText if you can is because it will form the basis of your project’s page on PyPI. PyPI will automatically read and format reStructuredText with headings and all that good jazz. You can write your readme file in Markdown or just plain text, but it won’t look as nice.

Finally, we already have a module[s], and a “docs” folder that Pycco generated with our documentation files, as well as a “cicero_examples.py” file. So let’s move on to the two files we haven’t encountered yet: MANIFEST.in and setup.py.

MANIFEST.in

Whichever Python Packaging utility (more on that in a moment) that you use to create your distribution file and submit your software to PyPI will include some files by default – the .py source files it can find, for one. Invariably, however, those will not be the only files you want to include as part of your package and/or distribution! Documentation, the changelog, and example files are all commonly overlooked by the packaging utilities but in fact critical parts of your finished package and distribution. The MANIFEST.in file’s job is to identify all these extra files to be included. To take python-cicero’s MANIFEST.in as an example:

You can just put all the files you want to include in your package/distribution in this file, with a preceding “include” statement. If you have a whole directory you want to include, save yourself some typing and use a “recursive-include” statement and asterisk to include all that directory’s files, like I do above for “docs”.

setup.py

This is the real glue that finally puts your package together. It’s actually a short Python program that is run when you first register your package on PyPI, again when you build a distribution file, and finally when you upload that distribution to PyPI. It’s usually pretty simple, with just an import statement to bring in your packaging utility and a call to that packaging utility’s setup() function, with many metadata parameters passed to that function:

Sidebar: what’s this “packaging utility” I’ve been referring to? I used a utility called “setuptools.” If you just want to get up and running, I recommend you use setuptools as well. If you’re using pip and virtualenv, you surely already have it in your virtualenv. Unless you have strange edge cases, it will also probably work to package your package. But there are other alternative packaging utilities out there with different edge cases and compatibilities, and this is one of the reasons Python packaging is so confusing. If you see references to other utilities by the names of distutils, distribute, distutils2, or even “bento” – don’t fret. They all accomplish roughly the same thing as setuptools. The first and second answers to this stackoverflow post give a great overview of what all these other utilities are and some of the open source community minutiae reasons why they exist and even why they are merging back with each other. Again, no need to stress over it, and just go with setuptools for now if you can.

Back to setup.py: There’s only two setup() parameters that are really essential: “name” and “packages”. “name” tells setuptools what the name of your package is, and “packages” tells setuptools what packages (really, multi-file modules and modules – again with Python’s terminology inconsistency!) are included in the package you’re creating. If you don’t have many packages, you can just list them. If you have a lot, or want a shortcut, you can import and use setuptools’ “find_packages()” function like I did, which searches the directories under setup.py recursively for all Python multi-file modules. In my case, it found both my “cicero” module and my “test” module under it.

All the other parameters I used, while not essential, are really really useful for both listing on PyPI and your users. Let’s go over a few:

  • version - As you fix bugs and add new features, you’ll likely upload and release new versions of your package. So give it a version number!
  • author and maintainer and email fields – You wrote it, give yourself credit! And if you’d like, give your email so your users can contact you with questions.
  • url - your project’s PyPI page is likely not the only or even the best location for information about your package. Put your extra URL’s if you have any here.
  • description and long_description - Your PyPI listing will be built from these. You can use Python to open and read your README file directly – again, if you wrote it in RST format, your PyPI page will be nicely formatted.
  • extras_require and/or install_requires - Use these if your project has other Python packages as dependencies. In the case of python-cicero, the wrapper itself is implemented entirely with the standard library, so nothing else is required. But if someone anticipates wanting to edit the documentation, they should install Pycco too. And this is what our extras_require entry would allow them to do:
    $ pip install python-cicero['docs']

    If you anticipate your users using pip to install your package, then you might also want a requirements.txt file. More information on handling requirements is available here and here.

  • classifiers - PyPI has an extensive list of classifiers for package listings. These are sort of like tags, and will help people find your project and understand a bit about it. Pick a few like a development status, license, and topic from this list exactly as they appear.

The list of options that can go into setup.py is quite extensive; look at the official docs for more but the above is certainly enough to get you started.

Submission to PyPI

We’ve made it to our last step! Our package and all its files are written, and we’re ready to register the project with PyPI and upload a distribution for others.

First, make accounts at both the test PyPI and the real PyPI. Especially for your first time, you’ll want to try this process out first on the test site – it gets cleaned out and reset every so often so there’s no risk if you mess up. You’ll want to make sure you’ve given your package a name that is not already taken on the real PyPI before you try and upload there, too. Once you take up a name on the live PyPI, you’ve taken that name as a possibility from other users forever.

Next, create a ~/.pypirc file in your home directory (Windows users – you’ll need to set a HOME environment variable to point to the location of this file):

[pypirc]
index-servers =
    test
    pypi

[test]
repository: https://testpypi.python.org/pypi
username:your_pypitest_username
password:your_pypitest_password

[pypi]
repository: https://pypi.python.org/pypi
username:your_pypi_username
password:your_pypi_password

With your login info saved in .pypirc, we have a few simple commands left:

$ python setup.py register -r test

The above should have registered your project with the test PyPI and created a page for it. See if you can get there by going to https://testpypi.python.org/pypi/name_of_your_package. If it worked, now you can build a source distribution file (sdist) and upload it to the test PyPI:

$ python setup.py sdist upload -r test

Look at your package’s test page – is there a tar.gz file listed near the end to download? Great! Now we can do the same process for real:

$ python setup.py register -r pypi
$ python setup.py sdist upload -r pypi

And we’re finally done. Your users should now be able to install your package easily with:

$ pip install your_package
$ #OR
$ easy_install your_package

Overview

Congratulations, you’ve just released some Python software! Now you know about:

  • The differences between a Python module, multi-file module, package, and distribution, and how they’re frequently confused
  • The Python Package Index
  • Creating key files like MANIFEST.in and setup.py which, in addition to Python modules, make up your Python package
  • The steps needed to upload and submit your package to both the PyPI test and PyPI Live instances

If you’re lost or curious, I found these resources incredibly helpful when going through this process for the first time:

Additionally, you can look to the packages Azaveans have contributed to PyPI as examples - django-queryset-csvpython-cicero, and python-omgeo. By all means, pip install them and try them out!

python-cicero: A New Wrapper for the Cicero API

This entry is part 1 of 2 in the series python-cicero and Python Packaging

python_cicero_logo

Last month, I was proud to release our first official language-specific “wrapper” for Cicero, our API for elected official data and district-matching and geocoding. “python-cicero,” as it’s called, is now available to all on Github or the Python Package Index (also known as PyPI). January also happened to be when the brand new GeoPhilly Meetup group was having it’s first joint meeting with the Philly Python User’s Group, and I was excited to have such a perfect nexus event with both Python and GIS nerds in the audience to give a talk about this project. In the words of one of our attendees, John Ashmead (who also has some background in science fiction writing), I did a good job in my talk of conveying the struggle and conflict between “man and machine” inherent in the process of releasing a Python package.

Yes, it’s sad but true: a certain dose of “man vs machine” conflict is inherent because the state of Python packaging is a total mess and has been for a long time. All newcomers, like myself or my colleague Steve Lamb (with his recently packaged django-queryset-csv project), soon discover this when they embark on distributing their first package, and even CPython core contributors admit it without hesitation. The crooked, winding, poorly documented road to a finished Python package is even more mind boggling when you consider that there are nearly 40,000 of these packages on PyPI. This is not a rare, obscure process. Python packages seem easy at face value.

The packaging process is a lot to cover though, so I’ll be writing a separate tutorial on that and my findings in an upcoming Azavea Labs post later this week. Stay tuned!

Designing a Wrapper

For this post, we’ll examine the wrapper itself, along with another face value assumption: that API wrappers are “small potatoes” projects. Searching Google or Github for “api wrapper” will give you an idea of how common these things are – and  frequently the same API will have duplicate wrappers written in the same language by different authors. And sure, when compared to large software projects like Azavea’s recent Coastal Resilience mapping application, or our veteran Homelessness Analytics visualization site, the 300 KB python-cicero library is tiny.

However, within the relatively small charge of a library intended to make HTTP requests to an API easier, there is a deceptively sizeable level of design considerations to take into account. Netherland points out a few of these in the previous link, particularly around “wrapping” versus “abstraction.” As when designing all software, especially when its intended to be used by others at a technical level, you have to think about how your users will use your tool and anticipate their needs and desires. Who uses your API? What for? Are your users technical enough that your wrapper is just saving them repeated calls to “urllib2.urlopen()”? Or would they appreciate some guidance and hand-holding in the form of additional abstraction? The answers to those questions inform the interface you design to your wrapper library. Not the most monumental task, but not the smallest either.

Some of our Cicero API users are very technical, and dive straight into the API. But often, our Cicero API clients come to us from smaller, nonprofit political advocacy groups. Sometimes the people who sign up for Cicero accounts at these organizations have a limited technical background – web development skills they’ve picked up on the side for specific projects here and there. It was this type of user that was in my mind as I designed python-cicero, and why I decided to lean towards more abstraction.

First Contact

Cicero is a paid service, so we’ve implemented a system of authentication to verify users querying the API have an account in good standing. Users send us their account username and password in the payload of a POST request, and we return back to them an authentication token and numeric user ID that they place in the query string of  the URLs for their subsequent calls to the API (which, incidentally, are all GET requests).

In the wrapper, I decided to abstract all of that. We have a class, “CiceroRestConnection”, which is instantiated with a username and password. That’s it! You are now ready to make all your API calls with this new class instance without ever having thought about tokens or POST requests or anything beyond remembering your login details.

Under the hood, the __init__ method of the CiceroRestConnection class takes the username and password info, encodes it into a payload, makes the request to Cicero’s /token/new.json endpoint, parses the token and user ID out of the successful response, and assigns these to class attributes so they’re available for use in other class methods for accessing other API endpoints. Roughly every 24 hours, authentication tokens will expire, and Cicero will respond to future calls using the expired token with 401 Unauthorized. If necessary, users can build logic into their Python applications to check for this response, and if received re-call __init__ again to reset their token or re-instantiate the class.

Getting Data

Taking our example “cicero” instance from before, we can make a request to the API’s /official endpoint. All endpoints in Cicero aside from requesting new tokens are HTTP GET requests, so I adopted this as my naming scheme for CiceroRestConnection class methods (“get_official()”, “get_nonlegislative_district()”, “get_election_event()”, etc). The user passes however many keyword arguments (all identical to those described in the Cicero API docs) they need to execute their query to the endpoint they’ve chosen (in this case, we kept it simple with one “search_loc” argument to geocode Azavea’s HQ address). The wrapper makes the request, and parses the response into another set of classes that can be easily navigated with Python’s dot notation, all with proper error handling. The user doesn’t have to fiddle with JSON, Python dictionaries, or anything.

Getting a specific official, district, or election event by its unique ID – in proper ReST fashion – requires placing this numeric ID directly in the root URL, not the query string as another keyword argument – ie, /official/123, not /official?id=123. This makes sense to someone familiar with ReST – you’re requesting a specific resource, and that should be part of the Uniform Resource Locator – but has easily tripped up beginners in the past who expect ID to be just another query string parameter. python-cicero resolves this by having all queries be composed of keyword arguments passed to any of our wrapper methods, including ID. We check for it’s presence and construct the URL appropriately without burdening the user:

Documentation Is Important

A key part of all developer-focused software is having good documentation. You won’t be around to explain how to use it to everyone, so you’d better write that down and write it down clearly. A stalwart in the Python world is the Sphinx system for generating docs. It’s a great tool, but I feel it’s a bit bloated for smaller projects. Also, I don’t like writing in reStructuredText as Sphinx requires and find Markdown to be a bit more intuitive. Furthermore, I personally really appreciate being able to see code alongside my docs, following along in each.

So I was very happy to find a lightweight alternative Python documentation generator, Pycco – a Python port of the Javascript-focused Docco. Pycco let’s you write docs as Python docstrings formatted in Markdown:

Then, run Pycco against your Python source files with one command:

$ pycco cicero/cicero_response_classes.py

And beautiful, fancy font, syntax highlighted HTML documentation pages pop out – code on one side, docs on the other. Easy!

Try it Out

If you’d like to give Cicero a try, python-cicero is now one of the easiest ways to do it. Either use Python’s “easy_install” utility or the (superior, if you have it) pip to install the wrapper:

$ easy_install python-cicero
$ #OR
$ pip install python-cicero

Take a look at the docs, available at http://azavea.github.io/python-cicero/ to get a sense of the methods available to you, as well as the “cicero_examples.py” file in the package.

And again, keep an eye out for my upcoming Labs post – we’ll dive in to the more-complex-than-necessary world of creating Python packages and submitting them to the Python Package Index, as I did with python-cicero, with a full tutorial! It should be ready to go this week.