Azavea Labs

Where software engineering meets GIS.

python-cicero: A New Wrapper for the Cicero API

This entry is part 1 of 2 in the series python-cicero and Python Packaging

python_cicero_logo

Last month, I was proud to release our first official language-specific “wrapper” for Cicero, our API for elected official data and district-matching and geocoding. “python-cicero,” as it’s called, is now available to all on Github or the Python Package Index (also known as PyPI). January also happened to be when the brand new GeoPhilly Meetup group was having it’s first joint meeting with the Philly Python User’s Group, and I was excited to have such a perfect nexus event with both Python and GIS nerds in the audience to give a talk about this project. In the words of one of our attendees, John Ashmead (who also has some background in science fiction writing), I did a good job in my talk of conveying the struggle and conflict between “man and machine” inherent in the process of releasing a Python package.

Yes, it’s sad but true: a certain dose of “man vs machine” conflict is inherent because the state of Python packaging is a total mess and has been for a long time. All newcomers, like myself or my colleague Steve Lamb (with his recently packaged django-queryset-csv project), soon discover this when they embark on distributing their first package, and even CPython core contributors admit it without hesitation. The crooked, winding, poorly documented road to a finished Python package is even more mind boggling when you consider that there are nearly 40,000 of these packages on PyPI. This is not a rare, obscure process. Python packages seem easy at face value.

The packaging process is a lot to cover though, so I’ll be writing a separate tutorial on that and my findings in an upcoming Azavea Labs post later this week. Stay tuned!

Designing a Wrapper

For this post, we’ll examine the wrapper itself, along with another face value assumption: that API wrappers are “small potatoes” projects. Searching Google or Github for “api wrapper” will give you an idea of how common these things are – and  frequently the same API will have duplicate wrappers written in the same language by different authors. And sure, when compared to large software projects like Azavea’s recent Coastal Resilience mapping application, or our veteran Homelessness Analytics visualization site, the 300 KB python-cicero library is tiny.

However, within the relatively small charge of a library intended to make HTTP requests to an API easier, there is a deceptively sizeable level of design considerations to take into account. Netherland points out a few of these in the previous link, particularly around “wrapping” versus “abstraction.” As when designing all software, especially when its intended to be used by others at a technical level, you have to think about how your users will use your tool and anticipate their needs and desires. Who uses your API? What for? Are your users technical enough that your wrapper is just saving them repeated calls to “urllib2.urlopen()”? Or would they appreciate some guidance and hand-holding in the form of additional abstraction? The answers to those questions inform the interface you design to your wrapper library. Not the most monumental task, but not the smallest either.

Some of our Cicero API users are very technical, and dive straight into the API. But often, our Cicero API clients come to us from smaller, nonprofit political advocacy groups. Sometimes the people who sign up for Cicero accounts at these organizations have a limited technical background – web development skills they’ve picked up on the side for specific projects here and there. It was this type of user that was in my mind as I designed python-cicero, and why I decided to lean towards more abstraction.

First Contact

Cicero is a paid service, so we’ve implemented a system of authentication to verify users querying the API have an account in good standing. Users send us their account username and password in the payload of a POST request, and we return back to them an authentication token and numeric user ID that they place in the query string of  the URLs for their subsequent calls to the API (which, incidentally, are all GET requests).

In the wrapper, I decided to abstract all of that. We have a class, “CiceroRestConnection”, which is instantiated with a username and password. That’s it! You are now ready to make all your API calls with this new class instance without ever having thought about tokens or POST requests or anything beyond remembering your login details.

Under the hood, the __init__ method of the CiceroRestConnection class takes the username and password info, encodes it into a payload, makes the request to Cicero’s /token/new.json endpoint, parses the token and user ID out of the successful response, and assigns these to class attributes so they’re available for use in other class methods for accessing other API endpoints. Roughly every 24 hours, authentication tokens will expire, and Cicero will respond to future calls using the expired token with 401 Unauthorized. If necessary, users can build logic into their Python applications to check for this response, and if received re-call __init__ again to reset their token or re-instantiate the class.

Getting Data

Taking our example “cicero” instance from before, we can make a request to the API’s /official endpoint. All endpoints in Cicero aside from requesting new tokens are HTTP GET requests, so I adopted this as my naming scheme for CiceroRestConnection class methods (“get_official()”, “get_nonlegislative_district()”, “get_election_event()”, etc). The user passes however many keyword arguments (all identical to those described in the Cicero API docs) they need to execute their query to the endpoint they’ve chosen (in this case, we kept it simple with one “search_loc” argument to geocode Azavea’s HQ address). The wrapper makes the request, and parses the response into another set of classes that can be easily navigated with Python’s dot notation, all with proper error handling. The user doesn’t have to fiddle with JSON, Python dictionaries, or anything.

Getting a specific official, district, or election event by its unique ID – in proper ReST fashion – requires placing this numeric ID directly in the root URL, not the query string as another keyword argument – ie, /official/123, not /official?id=123. This makes sense to someone familiar with ReST – you’re requesting a specific resource, and that should be part of the Uniform Resource Locator – but has easily tripped up beginners in the past who expect ID to be just another query string parameter. python-cicero resolves this by having all queries be composed of keyword arguments passed to any of our wrapper methods, including ID. We check for it’s presence and construct the URL appropriately without burdening the user:

Documentation Is Important

A key part of all developer-focused software is having good documentation. You won’t be around to explain how to use it to everyone, so you’d better write that down and write it down clearly. A stalwart in the Python world is the Sphinx system for generating docs. It’s a great tool, but I feel it’s a bit bloated for smaller projects. Also, I don’t like writing in reStructuredText as Sphinx requires and find Markdown to be a bit more intuitive. Furthermore, I personally really appreciate being able to see code alongside my docs, following along in each.

So I was very happy to find a lightweight alternative Python documentation generator, Pycco – a Python port of the Javascript-focused Docco. Pycco let’s you write docs as Python docstrings formatted in Markdown:

Then, run Pycco against your Python source files with one command:

$ pycco cicero/cicero_response_classes.py

And beautiful, fancy font, syntax highlighted HTML documentation pages pop out – code on one side, docs on the other. Easy!

Try it Out

If you’d like to give Cicero a try, python-cicero is now one of the easiest ways to do it. Either use Python’s “easy_install” utility or the (superior, if you have it) pip to install the wrapper:

$ easy_install python-cicero
$ #OR
$ pip install python-cicero

Take a look at the docs, available at http://azavea.github.io/python-cicero/ to get a sense of the methods available to you, as well as the “cicero_examples.py” file in the package.

And again, keep an eye out for my upcoming Labs post – we’ll dive in to the more-complex-than-necessary world of creating Python packages and submitting them to the Python Package Index, as I did with python-cicero, with a full tutorial! It should be ready to go this week.

Securing GIS applications with SSL and HSTS

In building the new version of our HunchLab product for crime forecasting we are very concerned about security.  Police departments work with data sets that not only require privacy but can also expose individuals to harm if the data is disclosed publicly.  As application developers we often focus on security within the data center.  While making sure that our application properly handles SQL injection attacks and maintains the proper firewalls is critical, the weakest security component is often the client themselves.   Here are a few tips to improving your application’s security outside of the data center.

SSL is not so simple

When we browse the web, we know that when we see the lock icon in the browser tab our communication is encrypted and therefore “secure”.  But not all TLS/SSL is actually secure.  In researching the topic for our development work, I encountered quite a scary reality.  More than 20% of websites support a version of the SSL protocol that is completely broken (SSL v2).  So what to do?

TLS 1.0 is supported by nearly all browsers — a notable exception is Internet Explorer 6.  If you can live without IE6 support (and I pray that you can) then there is no reason to support SSL version 2 or 3 in your application.  You should also enable TLS 1.1 and 1.2 for browsers that support these newer standards.

If you are operating a highly sensitive application and can control which browsers you support, then disabling TLS 1.0 and selecting the right options can also improve security.  I found the cipher chart and browser support charts on Wikipedia to be immensely helpful.   For HunchLab 2.0 we decided to only support TLS 1.1 and 1.2 since we are dealing with sensitive data.

These newer version of TLS give you the option of supporting ciphers with forward secrecy.  For instance, assume you have typical SSL settings turned on and the NSA logs all of the traffic to your website.  Perhaps they can’t decrypt the traffic today, but if at any time in the future they get a copy of your private SSL certificate key, then they can decrypt all historic traffic.  Forward secrecy prevents this issue by generating ephemeral keys for each connection.  Look for settings for ephemeral Diffie-Hellman (DHE) and elliptic curve Diffie–Hellman (ECDHE).

A great way to learn about these settings and see how your application does is to use this SSL Tester.

HTTP Strict Transport Security

Another way to improve security is to tell the user’s web browser to only use encrypted connections.  This is really easy to do via the HTTP Strict Transport Security (HSTS) response headers.  If you support SSL within your application for all URLs then there is no reason to wait to do this.  It’s a simple header.    Chrome, Firefox, and Opera currently support this standard.

Tips:

  • Support only TLS 1.0 or newer (TLS 1.1 or newer if you can)
  • Turn on TLS 1.1 and 1.2 and configure forward secrecy
  • Enable HSTS to force browsers to use HTTPS

Open Data from OpenTreeMap: Visualizing temporal data with CartoDB’s Torque

This entry is part 2 of 2 in the series Visualizing Open Data from OpenTreeMap with CartoDB

I just wrote up a meaty Labs post on my idea to visualize tree, species, and user edits over time within exported data from PhillyTreeMap.org, and already covered all the joining, formatting, converting, and uploading necessary to get to this point, along with some simple visualizations at the end. If you haven’t read it, go ahead. I’ll wait here. Because with this post I’m diving straight in to the temporal visualization features of CartoDB’s Torque.

Briefly, though, to reiterate: What are my goals for visualizing the 2 years of PhillyTreeMap user edits over time? I wanted to create something parallel to Mark Headd’s homicide data visualization (also done with Torque) but that told a story over time that was more uplifiting. (What’s more uplifting than trees?) I also hoped my visualization would give us a rough idea of what neighborhoods and areas around Philadelphia have the most active PhillyTreeMap user edits, as well as what times of year seem most active. One could use that knowledge to determine and plan where or when to do outreach about PhillyTreeMap or the programs of our partners, like PHS Tree Tenders. What neighborhoods don’t have many user edits? When does participation drop off? On the flip side, where and when are urban forestry efforts succeeding in engaging the community? A time based spatial visualization can help us answer those questions – and look really cool in the process!

One final caveat: it’s important to note that Torque is under very active development at CartoDB. I was looking for pointers as I was writing this blog, and the folks at CartoDB including Andrew Hill were very helpful on the mailing list and would be happy to answer other questions you have. But they told me the next generation version of the library is due to come out “soon”, with better documentation, and it may differ greatly from what I write about below. The visual effect of time based data in Torque is just so cool though, that I couldn’t wait!

Testing and Tweaking

CartoDB have set up a number of Torque demos right on Github Pages. You can look at their demo data, or plug in your CartoDB user, table, and column name into the options sidebar to visualize your own date-based data. My “plots_and_trees” table (that I created in the last blog) is set to public (yours must be as well if you want to use Torque, as currently the library doesn’t do password authentication), so feel free to use it if you wish: User “andrewbt”, table “plots_and_trees”, columns “tree_last_updated” or “plot_last_updated”.

The Torque demo gives you a number of options that affect the visual effect of your visualization, but because it’s in development I couldn’t find any good explanations of what they are or how best to use them. So I wrote some up. Congratulations, dear reader, you get to enjoy the fruits of my copious amounts of experimentation.

  • Resolution: Here you have a choice of doubling numbers from 1 to 32. What does it do? This effectively changes the granularity of the data points CartoDB will stream from your table to Torque. Or, as Andrew Hill explained on the Google Group, “resolution relates to the actual X, Y dimensions that data will collapse to coming from the server and drawing to pixels.” Point is, I noticed lower values seemed like they would more accurately reflect the location of the actual record, whereas larger values created a larger data point “dot” that gave a looser indication of actual location. It may be the case that for very large datasets, a larger resolution would make the animation faster or smoother. However, a resolution of 1 or 2 is fine for our PhillyTreeMap table. (more…)

Open Data from OpenTreeMap: Visualizing tree data with CartoDB

This entry is part 1 of 2 in the series Visualizing Open Data from OpenTreeMap with CartoDB

Update 12:30pm, 8-16-2013: CartoDB is working on a fix for the WKT issues I stumbled upon in this blog and tweeted a workaround. Thanks Javier!

Many months ago, after the City of Philadelphia released some of its Part 1 Crime Incident data on OpenDataPhilly, I read a blog post by our very own Chief Data Officer Mark Headd where he visualized 6 years of homicides in the City of Brotherly Love on a temporal map using CartoDB’s Torque library. While the story the map tells is an important one, it is also depressing and sad – every second, as you watch, more dots appear on your screen representing way too many homicides in our city:

I was talking with a friend outside Azavea about Headd’s visualization, and posed a question: “What positive, uplifting change over time in our city could we tell the story of?” I sometimes get the feeling that so much data and visualizations of it are negative or otherwise shock us: from our struggling education system, to stolen bikes, to the disparate impact of voter ID laws. While visualizations like these uncover important stories to tell, so much sad news (for me at least) can sap my motivation to help fix it all. We need to visualize the good and give praise for what’s working, as much as we should analyze the bad and criticize what still needs to be done.

Hearing my frustration, my friend asked, “What about tree plantings or something?”, I assume without even realizing the connection she had just made in my mind.

Of course! That’s it! I happen to work for Azavea, where we craft OpenTreeMap, the best open source public tree inventory software around! I knew I could easily export data from PhillyTreeMap.org for almost two full years worth of ongoing, crowdsourced tree inventory and edits to the map in Philadelphia. We know that having more green, leafy trees and nature around make people happier psychologically, increase property values, clean our air and water, and save electricity and our environment. This was going to be a fun project.

Open, really open

Usually I think of the “Open” in “OpenTreeMap” as referring to the fact that it’s open source software. But there’s no reason that word “open” can’t be referring to open data as well. When someone adds a tree or tree details to an OpenTreeMap site, they are creating new data. We have always been big proponents of open data at Azavea, having originally built the OpenDataPhilly catalog ourselves. We know data can be reused and analyzed in more ways than we ourselves can imagine or hope to build into OpenTreeMap itself. (Remember Amos Almy’s project?) So OpenTreeMap follows this philosophy, and doesn’t lock its data away once users collect it. Right next to the main map of every OpenTreeMap site are three little links: “Export this search: KML | CSV | Shapefile”.

Each of those links allows anyone to download the results of a search for a specific species of tree, the trees in a particular neighborhood, or even every tree and plot in the system, in three different widely-used geospatial data file formats. From there, you can use desktop analysis tools like ArcMap, QGIS, or Google Refine and Earth, or cloud services like Google Fusion Tables and CartoDB to filter, query, and visualize your collected tree data.

So, I went to PhillyTreeMap.org and waited eagerly as I downloaded a zipped file containing three CSVs with details on the 183,758 plots, 56,310 trees, and 300 species on the map. More on why I picked CSV versus KML or Shapefile in a moment…

(more…)

Geocoding with Cicero in Google Docs: An Open Source Collaboration

A Google Spreadsheet with addresses geocoded and stamped with their US House districts.

A Google Spreadsheet with addresses geocoded and stamped with their US House districts.

Several days ago as our Data Analytics and Marketing teams were planning the Azavea “Lunch and Learn” workshop series (our final Wednesday workshop is still open! Register here.), my colleague Jeremy Heffner discovered a script written originally by Dave Cole and Tom MacWright at MapBox: geo-googledocs.

Geo for Google Docs is a small Javascript add-on that adds address-based geocoding and GeoJSON export capabilities to Google Spreadsheets. So, if you have a long list of street addresses you need to geocode, you can easily throw them in a Google Spreadsheet and simply copy/paste this script via the Google Docs Script Editor. After only a few button clicks, latitude and longitude columns are added to your spreadsheet and populated with coordinates for each of your address records. How easy is that!?

Jeremy and Sarah Cordivano realized this would be a great tool to demo the principle of geocoding to attendees at Sarah’s “From Databases to Maps” Lunch and Learn workshop. Google Docs/Drive is free for personal use and nonprofit organizations and also used by many businesses, so tons of people are familiar with the software already. Many small organizations already use spreadsheets to collect data on the constituents they serve, and even if they’ve already invested in a more robust Contact or Constituent Relationship Management system (CRM) like Salesforce or CiviCRM, it’s easy to move records back and forth between a CRM and spreadsheets.

Our Cicero API supports address-based geocoding too, so Jeremy spent some time integrating Cicero as a geocoding provider into the MapBox geo-googledocs script, which already supported the Yahoo PlaceFinder and Mapquest Nominatim APIs.

We were all excited to see Tom from MapBox accept and merge Jeremy’s pull request in time for Sarah to show off the script and geocoding using the Cicero API at the July 17th Lunch and Learn workshop. She also wrote this handy PDF tutorial for installing the script and then publishing the results in a CartoDB web map.

While geocoding is cool, what really makes the Cicero API powerful is the wealth of data we have available for matching addresses to local, state, and national legislative districts, elected officials, census geographies, watersheds, schools, and more. We frequently get requests from advocacy groups and other clients to batch process a database of constituent addresses and “stamp” each record with district information. This got me thinking – it was easy enough for Jeremy to add latitude and longitude fields from Cicero to the script, what if customers could use it to batch process their own spreadsheets and include other data fields from Cicero like congressional districts or elected official contact info?

So, I made my own fork of the geo-googledocs script on Github and spent some time working on exactly that. Specifically, I added a “new” geocoder to the list, “cicero_ushouse”:

When the cicero_ushouse geocoder is used, the script makes a query to Cicero’s /legislative_district endpoint (much like Jeremy’s original modification) but filters on district_type=NATIONAL_LOWER, which for addresses in the USA will result in the return of their US House of Representatives district. I also added code in other areas of the script to pull the district’s label, state, and alphanumeric ID, and populate three additional columns in the spreadsheet with those fields. Mapbox’s code was flexible enough that these additional columns are added as attributes to each of the address points when exporting the data to GeoJSON – I didn’t need to add any extra code! Having political districts as attributes for your addresses can enable you to make choropleth maps or other analysis of where your addresses lie politically. My fork is only a proof of concept at this point, but I plan to keep adding to it. But right now it demonstrates that with just a few extra lines of Javascript, anyone can customize geo-googledocs to pull whatever fields they’d like from Cicero’s geocoding endpoints.

geo-googledocs is a tiny, one-file repository that accomplishes a fairly basic (though crucial) task, but it embodies a lot of the power I see in open source software. The MapBox developers originally wrote it a year ago. Things could have ended there. But thanks to Github, we were able to discover, fork, improve, and contribute to the original project. Even if it is just one file, it’s pretty cool for informal collaboration like this to happen so seamlessly!