Azavea Labs

Where software engineering meets GIS.

Batch District Matching Using the Cicero API with OpenRefine

OpenRefine (formerly Google Refine) is an awesome open source tool for working with data. If you haven’t heard of it before, in the words of Christopher Groskopf, “”Once you’ve clustered and reconciled your crufty public dataset into a glistening gem of normality you won’t know how you lived without it.”

Even if you have a dataset that’s useable already though, you might want to add more data to it. This is often why clients come to us for Cicero batch processing and district stamping. Clients can give us a spreadsheet of data with street addresses, often a list of supporters or members exported from their CRM system. Then, we can use the expansive database of elected officials and political districts that underpins our Cicero API to process these large batch processing jobs, geocoding and providing official and district information for each record.

However, one of the cool things about OpenRefine is that you can use it yourself to perform similar batch processing tasks with external APIs, like Cicero! In this blog post, we’ll use OpenRefine to add Philadelphia city council district information to an open government dataset of all Charter School locations in the city. Why charter school data? Whether you’re for or against them, there’s no question that charter schools are a tough local political issue being debated by communities across the country. Using OpenRefine and Cicero to determine the council districts of each charter school in Philadelphia would enable us to determine how many charter schools are in each councilmember’s district. That would be useful information to make councilmembers aware of if we were conducting local advocacy work on the merits or drawbacks of this educational approach. With 84 charters in the city, too, this would be a laborious task without OpenRefine!

We’ll start by downloading the zipped CSV file from the School District of Philadelphia’s Open Data Initiative site, which can be found through OpenDataPhilly. We see that the file has a few key fields we’ll be using to interact with Cicero – address, zip code, city and state.

Mmmmm, tabular data.

Mmmmm, tabular data.


GeoTrellis Transit on iOS with WhirlyViz

I was recently introduced to Steve Gifford at Mousebird Consulting, a software firm based in San Francisco that builds mapping tools for the iOS platform.  Steve and his colleagues are the developers of the open source iOS mapping framework, WhirlyGlobe Maply.  The framework enables them to build both 2D and 3D mapping applications for iPhones and iPads.  It’s slick, impressive technology that is sort of a combination of the Google Earth globe and a conventional, web-based mapping application.


Mousebird Consulting joined the LocationTech working group at the Eclipse Foundation in March.  LocationTech is a young organization and while there are now several projects moving through the incubation process (GeoTrellis is one of them), there is not yet a lot of coordination or integration between projects.  So I was really excited to see Steve take the initiative to integrate one of our GeoTrellis examples, the GeoTrellis Transit API demo, into Mousebird’s WhirlyViz application. GeoTrellis Transit is an extension of the core GeoTrellis framework.


While the core GeoTrellis is primarily focused on fast, distributed raster data processing, the GT Transit project adds support for fast network routing and incorporates both GTFS and OpenStreetMap parsing, a high performance network data structure and support for routing and calculation of time-dependent “travelsheds”, the area a traveler can reach within X minutes.  By “time-dependent”, I mean that GT Transit can calculate transit access areas for a specific time of day and days of week using the schedule information encoded in a GTFS data set.  All of this is wrapped by an API.  When we launched GeoTrellis Transit, we also set up a couple of demos using data for Philadelphia – a travelshed calculator and a “scenic route” demo that shows where you can wander between a starting and ending point and still arrive on time. The WhirlyViz app has some nice design features.  It’s a native iOS app, but it uses JSON and Javascript for configuration, and Steve was able to add a new configuration without having to roll out a new application.  Steve picked up the Travelshed API and turned it into a new configuration of the WhirlyViz app.  It’s pretty cool.  In addition to showing the travelsheds, you can set the day-of-week, time-of-day and transit modes.  He wrote up some details in a blog post he published last week.  Here are a few screenshots.

GeoTrellis Transit in WhirlyViz

GeoTrellis Transit uses OpenStreetMap and a GTFS file to enable generation of “travel-sheds”. This one shows walking distance are around downtown.

GeoTrellis Transit in WhirlyViz

The accessible area changes a great deal when we add access to regional rail.


Solving Unicode Problems in Python 2.7

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xd1 in position 1: ordinal not in range(128) (Why is this so hard??)

One of the toughest things to get right in a Python program is Unicode handling. If you’re reading this, you’re probably in the middle of discovering this the hard way.

The main reasons Unicode handling is difficult in Python is because the existing terminology is confusing, and because many cases which could be problematic are handled transparently. This prevents many people from ever having to learn what’s really going on, until suddenly they run into a brick wall when they want to handle data that contains characters outside the ASCII character set.

If you’ve just run into the Python 2 Unicode brick wall, here are three steps you can take to start thinking about strings and Unicode the right way:

1. str is for bytes, NOT strings

The first step toward solving your Unicode problem is to stop thinking of type< ‘str’> as storing strings (that is, sequences of human-readable characters, a.k.a. text). Instead, start thinking of type< ‘str’> as a container for bytes. Objects of type< ‘str’> are in fact perfectly happy to store arbitrary byte sequences.

To get yourself started, take a look at the string literals in your code. Every time you see ‘abc’, “abc”, or “””abc”””, say to yourself “That’s a sequence of 3 bytes corresponding to the ASCII codes for the letters a, b, and c” (technically, it’s UTF-8, but ASCII and UTF-8 are the same for Latin letters.

2. unicode is for strings

The second step toward solving your problem is to start using type< ‘unicode’> as your go-to container for strings.

For starters, that means using the “u” prefix for literals, which will create objects of type< ‘unicode’> rather than regular quotes, which will create objects of type< ‘str’> (don’t bother with the docstrings; you’ll rarely have to manipulate them yourself, which is where problems usually happen). There are some other good practices which I’ll discuss below.

3. UTF-8, UTF-16, and UTF-32 are serialization formats — NOT Unicode

UTF-8 is an encoding, just like ASCII (more on encodings below), which is represented with bytes. The difference is that the UTF-8 encoding can represent every Unicode character, while the ASCII encoding can’t. But they’re both still bytes. By contrast, an object of type< ‘unicode’> is just that — a Unicode object. It isn’t encoded or represented by any particular sequence of bytes. You can think of Unicode objects as storing abstract, Platonic representations of text, while ASCII, UTF-8, UTF-16, etc. are different ways of serializing (encoding) your text.

Okay, but why can’t I use str for strings? (Detailed problem description)

The reason for going through the mind-shift above is that since type< ‘str’> stores bytes, it has an implicit encoding, and encodings (and/or attempts to decode the wrong encoding) cause the majority of Unicode problems in Python 2.

What do I mean by encoding? It’s the sequence of bits used to represent the characters that we read. That is, the “abc” string from above is actually being stored like this: 01100001 0100010 01100011.

But there are other ways to store “abc” — if you store it in UTF-8, it looks exactly like the ASCII version because UTF-8 and ASCII are the same for Latin letters. But if you store “abc” in UTF-16, you get 0000000001100001 0000000001100010 0000000001100011.

Encodings are important because you have to use them whenever text travels outside the bounds of your program–if you want to write a string to a file, or send it over a network, or store it in a database, it needs to have an encoding. And if you send out the wrong encoding (that is, a byte sequence that your receiver doesn’t expect), you’ll get Unicode errors.

The problem with type< ‘str’>, and the main reason why Unicode in Python 2.7 is confusing, is that the encoding of a given instance of type< ‘str’> is implicit. This means that the only way to discover the encoding of a given instance of type< ‘str’> is to try and decode the byte sequence, and see if it explodes. Unfortunately, there are lots of places where byte sequences get invisibly decoded, which can cause confusion and problems. Here are some example lines to demonstrate:

# Set up the variables we'll use
>>> uni_greeting = u'Hi, my name is %s.'
>>> utf8_greeting = uni_greeting.encode('utf-8')

>>> uni_name = u'José'  # Note the accented e.
>>> utf8_name = uni_name.encode('utf-8')

# Plugging a Unicode into another Unicode works fine
>>> uni_greeting % uni_name
u'Hi, my name is Jos\xe9.'

# Plugging UTF-8 into another UTF-8 string works too
>>> utf8_greeting % utf8_name
'Hi, my name is Jos\xc3\xa9.'

# You can plug Unicode into a UTF-8 byte sequence...
>>> utf8_greeting % uni_name  # UTF-8 invisibly decoded into Unicode; note the return type
u'Hi, my name is Jos\xe9.'

# But plugging a UTF-8 string into a Unicode doesn't work so well...
>>> uni_greeting % utf8_name  # Invisible decoding doesn't work in this direction.
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

# Unless you plug in ASCII-compatible data, that is.
>>> uni_greeting % u'Bob'.encode('utf-8')
u'Hi, my name is Bob.'

# And you can forget about string interpolation completely if you're using UTF-16.
>>> uni_greeting.encode('utf-16') % uni_name
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: unsupported format character '' (0x0) at index 33

# Well, you can interpolate utf-16 into utf-8 because these are just byte sequences
>>> utf8_greeting % uni_name.encode('utf-16')  # But this is a useless mess
'Hi, my name is \xff\xfeJ\x00o\x00s\x00\xe9\x00.'

The examples above should show you why using type< ‘str’> is problematic; invisible decoding coupled with the implicit encodings for type< ‘str’> can hide serious problems. Everything will work just fine as long as your code handles strictly ASCII data. Then, one day, a hapless “é” will blunder into your input. Code which implicitly assumes (and invisibly decodes) ASCII-encoded input will suddenly have to contend with UTF-8-encoded data, and the whole thing can blow up; even your exception handlers may start throwing UnicodeDecodeErrors.

Solution: The Unicode ‘airlock’

The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence.

The most systematic way to accomplish this is to make your code into a Unicode-only clean room. That is, your code should only use Unicode objects internally; you may even want to put checks for type< ‘unicode’> in key places to keep yourself honest.
Then, put ‘airlocks’ at the entry points to your code which will ensure that any byte sequence attempting to enter your code is properly clothed in a protective Unicode bunny suit before being allowed inside.

For example:

with f = open('file.txt'):  # BAD--gives you bytes
with f ='file.txt', encoding='utf-8'):  # GOOD--gives you Unicode

This might sound slow and cumbersome, but it’s actually pretty easy; most well-known Python libraries follow this practice already, so you usually only need to worry about input coming from files, network requests, etc.

Airlock Construction Kit (Useful Unicode tools)

Nearly every Unicode problem can be solved by the proper application of these tools; they will help you build an airlock to keep the inside of your code nice and clean:

  • encode(): Gets you from Unicode -> bytes
  • decode(): Gets you from bytes -> Unicode
  •”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
  • u”: Makes your string literals into Unicode objects rather than byte sequences.

Warning: Don’t use encode() on bytes or decode() on Unicode objects.


The key to troubleshooting Unicode errors in Python is to know what types you have. Then, try these steps:

  1. If some variables are byte sequences instead of Unicode objects, convert them to Unicode objects with decode() / u” before handling them.

    >>> uni_greeting % utf8_name
    Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
    # Solution:
    >>> uni_greeting % utf8_name.decode('utf-8')
    u'Hi, my name is Jos\xe9.'
  2. If all variables are byte sequences, there is probably an encoding mismatch; convert everything to Unicode objects with decode() / u” and try again.

  3. If all variables are already Unicode, then part of your code may not know how to deal with Unicode objects; either fix the code, or encode to a byte sequence before sending the data (and make sure to decode any return values back to Unicode):

    >>> with open('test.out', 'wb') as f:
    >>>     f.write(uni_name)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
    # Solution:
    >>> f.write(uni_name.encode('utf-8'))
    # Better Solution:
    >>> with'test.out', 'w', encoding='utf-8') as f:
    >>>     f.write(uni_name)

Other points

Python 3 solves this problem by becoming more explicit: string literals are now Unicode by default, while byte sequences are stored in a new type called ‘byte’.

For a much more thorough look at these issues, take a look at .

Good luck!

Exporting Django Querysets to CSV

At Azavea, we have numerous client projects that must provide exportable data in CSV format. The reasons for this range from simple data exchange to complex export-modify-import workflows. In order to make this process easier for django projects, we made a simple utility, django-queryset-csv, for exporting django querysets directly to HTTP responses with CSVs attached.


So you have something like this:


pip install django-queryset-csv

Pain Points

Why bother? Can’t you write a for-loop to export a CSV in a dozen or so lines of code? It turns out there are a few pain points we run into over and over again with CSV exports:

  • We’re currently using python 2.7 for our django projects, where the provided CSV library has poor support for unicode characters. This has to be addressed somehow, usually by utf-8 encoding python strings before writing them to the CSV.
  • Adding a BOM character to CSVs with utf8 encoding is required for them to open properly in Microsoft Excel.

These are delicate behaviors that we prefer to have handled by a dedicated library with adequate unit test coverage.


In addition, we found ourselves repeatedly wanting the same helper utilities, and to have them work together consistently and predictably.

  • The ability to generate a filename automatically based on the underlying data.

  • The ability to generate timestamps for each export

  • The ability to generate column headers from the underlying data, sane defaults, custom overrides, or some flexible combination of the three.

    In this case, field_header_map takes precedence if provided, followed by verbose_name if specified and not disabled, followed finally by the underlying model field names.

Advanced Usage

Moving this library into production, we quickly discovered some advanced features that we needed the library to support.

Foreign Keys

The most obvious is foreign key fields. This is supported using the .values() method of a queryset, which is able to walk relationships in the same fashion as other ORM directives, using double-underscores. Note that you can’t make use of verbose names in this case, but you can provide custom overrides in the field_header_map argument:

Asynchronous Export

Sometimes, you can’t return a CSV response in the same thread. We ran into this problem because sometimes CSV exports, or the queries that produce them, take too long for the request/response cycle, and tying down a web worker for that period is unnacceptable. For this case, django-queryset-csv provides a lower-level function for writing CSV data to a file-like object of any kind:

Final Thoughts

We’re using django-queryset-csv in production, but the project remains in beta. We hope it’ll make CSV exports a little less painful for you too. Please report all bugs via github.

GeoTrellis 0.9 is out


The Legendary Island of Avalon (image credit)


The GeoTrellis team is very excited to announce the availability of GeoTrellis 0.9 (codename “Avalon”), a significant new release that is a big step forward towards our goal of a general purpose, high performance raster geoprocessing library and runtime designed to perform and scale for the web.

First of all, we’ve significantly revised the documentation site  Props to Rob Emanuele for the new site and the Azavea design team for revised styling.  The new site includes both case studies and some samples we’ve developed since the 0.8 release.  There is a full set of release notes available, but here are some highlights:

      • API Refactor: We’re moving away from requiring users to manually create operations and pass in rasters as arguments, and instead having objects called ‘DataSources’ that represent the source of data, with the operations to transform or combine that data as methods on those objects. These methods are not stateful, they return new DataSources per method call. Similar to the ‘Future’-like model of Operations, transformations on a DataSource are not actually run until the server is told to ‘run’ the source, either explicitly or through a method call on DataSource and an implicit server parameter. Special thanks to joshmarcus for his vision of this API change and all the work he put into making it happen. This API change also means that any code that currently runs on an 0.8 release will probably be very broken. The ability to create and run Op[T]’s is still there, but some of the functionality, especially dealing with parallelizing over tiles, was stripped from them. Let us know on the mailing list or in #geotrellis on freenode IRC if you’re upgrading and we’ll lend a hand with the transition
      • File I/O: Reading ARGs from disk has been sped up, and in some cases, you’ll find improvements of an order of magnitude or more.
      •  Replaced Jetty with, a fast HTTP server for Akka Actors
      • Tile operation improvements: Running multiple operations over tiled data has been greatly improved. For example, if you were to multiply a raster by an integer, add the result to another tiled raster, and then run a zonal summary (such as taking the mean of values within a polygon) on that result, GeoTrellis 0.8 would unnecessarily reduce to a whole raster in memory between the different transformations (see issue #517). In 0.9, you’ll get the desired behavior, where the multiplication, addition, and zonal summary all all done in parallel per tile, before the final zonal summary result is created from the reduction of the tile zonal summary results.
      • Clustering improvements: We took several steps to make it easier to distribute operations over a cluster using Akka clustering. There’s a .distribute call on DataSource which will distribute all of the operations of the DataSource’s elements across the cluster, based on the configuration.
      • Macros: A new geotrellis-macros project was created to deal with issue #624, based on the discussion of #324. This includes macros for checking whether a value is or isn’t NoData, independent of what type that data is. And these checks are inlined through macro magic, so there’s no performance hit for the nicer syntax.
      • Revised Operations: Added double support to Focal Standard Deviation, Focal Min, Focal Max, Focal Sum operations; added 8-neighbor connectivity to RegionGroup;
      • New Operations: Ordinary Kriging Interpolation, Hydrology operations (Fill, Flow Accumulations, Flow Direction), IDW Interpolation

Please let the team know — via the #geotrellis channel on Freenode IRC or the geotrellis-user Google Group mailing list — if you have any comments or suggestions.


Version 0.10 Plans

We’re hard at work on a GeoTrellis 0.10.  The major plans for this release include:

      • Integrate Apache Spark 
      • Support for operating on data stored in the Hadoop Distributed File System (HDFS)
      • Support for multi-band rasters
      • Develop a scala wrapper for JTS
      • Add more operations
      • Formal move to LocationTech working group at the Eclipse Foundation


More Info


GeoTrellis is released under the Apache2 license.