Azavea Labs

Where software engineering meets GIS.

python-cicero: A Python Packaging Tutorial

This entry is part 2 of 2 in the series python-cicero and Python Packaging

As mentioned in my last Labs post, last month I released python-cicero, a Python wrapper library for Azavea’s Cicero API. You might recall me mentioning in that post that the Python packaging process is a bit of a mess. You might also remember the talk I gave about the project at the GeoPhilly Meetup group drew praise from some of the attendees for its conveyance of man versus machine conflict inherent in the packaging journey. If you took a look at those slides, you may have noticed pictures of rubber ducks and horses holding cats, which I shamelessly stole from another fantastic but exasperated-feeling programming talk which accurately captured my sentiments towards Python packages at the time:

Wat - wæt: n. The only proper response to something that makes absolutely no sense.

Wat – wæt: n. The only proper response to something that makes absolutely no sense.

We wrote our Python wrapper. We have docs. Even unit tests. One would think we’d be past most of the hurdles that stand between us and shipped Python code, but there is one final harder-than-it-should-be section of the journey to overcome: How do we make it so other people can install and use our wrapper?

The answer is to turn our wrapper into a Python package, and upload it to the Python Package Index. Why is this hard? One issue is the lack of clear, authoritative documentation on the process – part of the reason I’m writing this post. So it should come as no surprise our first obstacle is one of vocabulary.

Modules, Packages, and Distributions

At first, our wrapper is just a Python “module” – just some .py source files. Modules can be just one .py file, or “multi-file modules” like in our case – several .py files in a directory (in our case, “cicero“) with a special __init__.py file that tells Python to treat the whole directory as one module. As a multi-file module, users will be able to import everything necessary for the wrapper with a simple “from cicero import *”, or even “from cicero import CiceroRestConnection”.

To make it so others can easily download and install it, with either Python’s “easy_install” command or the far superior “pip”, we have to make our module a proper Python “package” and upload a version of that package (called a “distribution” file) to the Python Package Index.

Those three terms bear repeating. A module is one or many Python source files. A package is is one or many modules, as well as some supporting files which we’ll get into below. A distribution is an archive file (think tarball) of a package that is uploaded to PyPI and what your users will actually download from the internet and install with easy_install or pip.

Having gone through this process, I believe the Python community does not take sufficient care to distinguish among these three terms when discussing packaging. Often, Pythonistas will refer to pretty much everything as a “package”. This results in unnecessary confusion and contradiction for newcomers as they try to understand the already messy packaging process. ”pip” stands for Pip Installs Packages, when really it’s often downloading/installing distribution files. The Python Package Index is not called the Python Distribution Index, when it probably should be. Folks will refer to a directory of Python files as a package, when they probably really mean a multi-file module.

The Packaging Process

With our terminology settled, what are the “supporting files” I mentioned that go into a package? I’m glad you asked! Here’s a list of the key ones:

  • The modules to be packaged
  • A changelog – CHANGES.txt is the convention
  • A license if the package is open source – LICENSE.txt is the convention
  • A readme file written in reStructuredText, and that’s more than just a convention (see below)
  • A MANIFEST.in file
  • A setup.py file
  • Other non-essential but related files: documentation, example scripts, tests, etc

I’ll assume you know about changelogs and licenses and readme files – if not, they’re easy to find out about and no specific formatting is required for your package, it’s just “A Good Idea TM” to have them. However, the reason you should write your readme file in reStructuredText if you can is because it will form the basis of your project’s page on PyPI. PyPI will automatically read and format reStructuredText with headings and all that good jazz. You can write your readme file in Markdown or just plain text, but it won’t look as nice.

Finally, we already have a module[s], and a “docs” folder that Pycco generated with our documentation files, as well as a “cicero_examples.py” file. So let’s move on to the two files we haven’t encountered yet: MANIFEST.in and setup.py.

MANIFEST.in

Whichever Python Packaging utility (more on that in a moment) that you use to create your distribution file and submit your software to PyPI will include some files by default – the .py source files it can find, for one. Invariably, however, those will not be the only files you want to include as part of your package and/or distribution! Documentation, the changelog, and example files are all commonly overlooked by the packaging utilities but in fact critical parts of your finished package and distribution. The MANIFEST.in file’s job is to identify all these extra files to be included. To take python-cicero’s MANIFEST.in as an example:

You can just put all the files you want to include in your package/distribution in this file, with a preceding “include” statement. If you have a whole directory you want to include, save yourself some typing and use a “recursive-include” statement and asterisk to include all that directory’s files, like I do above for “docs”.

setup.py

This is the real glue that finally puts your package together. It’s actually a short Python program that is run when you first register your package on PyPI, again when you build a distribution file, and finally when you upload that distribution to PyPI. It’s usually pretty simple, with just an import statement to bring in your packaging utility and a call to that packaging utility’s setup() function, with many metadata parameters passed to that function:

Sidebar: what’s this “packaging utility” I’ve been referring to? I used a utility called “setuptools.” If you just want to get up and running, I recommend you use setuptools as well. If you’re using pip and virtualenv, you surely already have it in your virtualenv. Unless you have strange edge cases, it will also probably work to package your package. But there are other alternative packaging utilities out there with different edge cases and compatibilities, and this is one of the reasons Python packaging is so confusing. If you see references to other utilities by the names of distutils, distribute, distutils2, or even “bento” – don’t fret. They all accomplish roughly the same thing as setuptools. The first and second answers to this stackoverflow post give a great overview of what all these other utilities are and some of the open source community minutiae reasons why they exist and even why they are merging back with each other. Again, no need to stress over it, and just go with setuptools for now if you can.

Back to setup.py: There’s only two setup() parameters that are really essential: “name” and “packages”. “name” tells setuptools what the name of your package is, and “packages” tells setuptools what packages (really, multi-file modules and modules – again with Python’s terminology inconsistency!) are included in the package you’re creating. If you don’t have many packages, you can just list them. If you have a lot, or want a shortcut, you can import and use setuptools’ “find_packages()” function like I did, which searches the directories under setup.py recursively for all Python multi-file modules. In my case, it found both my “cicero” module and my “test” module under it.

All the other parameters I used, while not essential, are really really useful for both listing on PyPI and your users. Let’s go over a few:

  • version - As you fix bugs and add new features, you’ll likely upload and release new versions of your package. So give it a version number!
  • author and maintainer and email fields – You wrote it, give yourself credit! And if you’d like, give your email so your users can contact you with questions.
  • url - your project’s PyPI page is likely not the only or even the best location for information about your package. Put your extra URL’s if you have any here.
  • description and long_description - Your PyPI listing will be built from these. You can use Python to open and read your README file directly – again, if you wrote it in RST format, your PyPI page will be nicely formatted.
  • extras_require and/or install_requires - Use these if your project has other Python packages as dependencies. In the case of python-cicero, the wrapper itself is implemented entirely with the standard library, so nothing else is required. But if someone anticipates wanting to edit the documentation, they should install Pycco too. And this is what our extras_require entry would allow them to do:
    $ pip install python-cicero['docs']

    If you anticipate your users using pip to install your package, then you might also want a requirements.txt file. More information on handling requirements is available here and here.

  • classifiers - PyPI has an extensive list of classifiers for package listings. These are sort of like tags, and will help people find your project and understand a bit about it. Pick a few like a development status, license, and topic from this list exactly as they appear.

The list of options that can go into setup.py is quite extensive; look at the official docs for more but the above is certainly enough to get you started.

Submission to PyPI

We’ve made it to our last step! Our package and all its files are written, and we’re ready to register the project with PyPI and upload a distribution for others.

First, make accounts at both the test PyPI and the real PyPI. Especially for your first time, you’ll want to try this process out first on the test site – it gets cleaned out and reset every so often so there’s no risk if you mess up. You’ll want to make sure you’ve given your package a name that is not already taken on the real PyPI before you try and upload there, too. Once you take up a name on the live PyPI, you’ve taken that name as a possibility from other users forever.

Next, create a ~/.pypirc file in your home directory (Windows users – you’ll need to set a HOME environment variable to point to the location of this file):

[pypirc]
index-servers =
    test
    pypi

[test]
repository: https://testpypi.python.org/pypi
username:your_pypitest_username
password:your_pypitest_password

[pypi]
repository: https://pypi.python.org/pypi
username:your_pypi_username
password:your_pypi_password

With your login info saved in .pypirc, we have a few simple commands left:

$ python setup.py register -r test

The above should have registered your project with the test PyPI and created a page for it. See if you can get there by going to https://testpypi.python.org/pypi/name_of_your_package. If it worked, now you can build a source distribution file (sdist) and upload it to the test PyPI:

$ python setup.py sdist upload -r test

Look at your package’s test page – is there a tar.gz file listed near the end to download? Great! Now we can do the same process for real:

$ python setup.py register -r pypi
$ python setup.py sdist upload -r pypi

And we’re finally done. Your users should now be able to install your package easily with:

$ pip install your_package
$ #OR
$ easy_install your_package

Overview

Congratulations, you’ve just released some Python software! Now you know about:

  • The differences between a Python module, multi-file module, package, and distribution, and how they’re frequently confused
  • The Python Package Index
  • Creating key files like MANIFEST.in and setup.py which, in addition to Python modules, make up your Python package
  • The steps needed to upload and submit your package to both the PyPI test and PyPI Live instances

If you’re lost or curious, I found these resources incredibly helpful when going through this process for the first time:

Additionally, you can look to the packages Azaveans have contributed to PyPI as examples - django-queryset-csvpython-cicero, and python-omgeo. By all means, pip install them and try them out!

python-cicero: A New Wrapper for the Cicero API

This entry is part 1 of 2 in the series python-cicero and Python Packaging

python_cicero_logo

Last month, I was proud to release our first official language-specific “wrapper” for Cicero, our API for elected official data and district-matching and geocoding. “python-cicero,” as it’s called, is now available to all on Github or the Python Package Index (also known as PyPI). January also happened to be when the brand new GeoPhilly Meetup group was having it’s first joint meeting with the Philly Python User’s Group, and I was excited to have such a perfect nexus event with both Python and GIS nerds in the audience to give a talk about this project. In the words of one of our attendees, John Ashmead (who also has some background in science fiction writing), I did a good job in my talk of conveying the struggle and conflict between “man and machine” inherent in the process of releasing a Python package.

Yes, it’s sad but true: a certain dose of “man vs machine” conflict is inherent because the state of Python packaging is a total mess and has been for a long time. All newcomers, like myself or my colleague Steve Lamb (with his recently packaged django-queryset-csv project), soon discover this when they embark on distributing their first package, and even CPython core contributors admit it without hesitation. The crooked, winding, poorly documented road to a finished Python package is even more mind boggling when you consider that there are nearly 40,000 of these packages on PyPI. This is not a rare, obscure process. Python packages seem easy at face value.

The packaging process is a lot to cover though, so I’ll be writing a separate tutorial on that and my findings in an upcoming Azavea Labs post later this week. Stay tuned!

Designing a Wrapper

For this post, we’ll examine the wrapper itself, along with another face value assumption: that API wrappers are “small potatoes” projects. Searching Google or Github for “api wrapper” will give you an idea of how common these things are – and  frequently the same API will have duplicate wrappers written in the same language by different authors. And sure, when compared to large software projects like Azavea’s recent Coastal Resilience mapping application, or our veteran Homelessness Analytics visualization site, the 300 KB python-cicero library is tiny.

However, within the relatively small charge of a library intended to make HTTP requests to an API easier, there is a deceptively sizeable level of design considerations to take into account. Netherland points out a few of these in the previous link, particularly around “wrapping” versus “abstraction.” As when designing all software, especially when its intended to be used by others at a technical level, you have to think about how your users will use your tool and anticipate their needs and desires. Who uses your API? What for? Are your users technical enough that your wrapper is just saving them repeated calls to “urllib2.urlopen()”? Or would they appreciate some guidance and hand-holding in the form of additional abstraction? The answers to those questions inform the interface you design to your wrapper library. Not the most monumental task, but not the smallest either.

Some of our Cicero API users are very technical, and dive straight into the API. But often, our Cicero API clients come to us from smaller, nonprofit political advocacy groups. Sometimes the people who sign up for Cicero accounts at these organizations have a limited technical background – web development skills they’ve picked up on the side for specific projects here and there. It was this type of user that was in my mind as I designed python-cicero, and why I decided to lean towards more abstraction.

First Contact

Cicero is a paid service, so we’ve implemented a system of authentication to verify users querying the API have an account in good standing. Users send us their account username and password in the payload of a POST request, and we return back to them an authentication token and numeric user ID that they place in the query string of  the URLs for their subsequent calls to the API (which, incidentally, are all GET requests).

In the wrapper, I decided to abstract all of that. We have a class, “CiceroRestConnection”, which is instantiated with a username and password. That’s it! You are now ready to make all your API calls with this new class instance without ever having thought about tokens or POST requests or anything beyond remembering your login details.

Under the hood, the __init__ method of the CiceroRestConnection class takes the username and password info, encodes it into a payload, makes the request to Cicero’s /token/new.json endpoint, parses the token and user ID out of the successful response, and assigns these to class attributes so they’re available for use in other class methods for accessing other API endpoints. Roughly every 24 hours, authentication tokens will expire, and Cicero will respond to future calls using the expired token with 401 Unauthorized. If necessary, users can build logic into their Python applications to check for this response, and if received re-call __init__ again to reset their token or re-instantiate the class.

Getting Data

Taking our example “cicero” instance from before, we can make a request to the API’s /official endpoint. All endpoints in Cicero aside from requesting new tokens are HTTP GET requests, so I adopted this as my naming scheme for CiceroRestConnection class methods (“get_official()”, “get_nonlegislative_district()”, “get_election_event()”, etc). The user passes however many keyword arguments (all identical to those described in the Cicero API docs) they need to execute their query to the endpoint they’ve chosen (in this case, we kept it simple with one “search_loc” argument to geocode Azavea’s HQ address). The wrapper makes the request, and parses the response into another set of classes that can be easily navigated with Python’s dot notation, all with proper error handling. The user doesn’t have to fiddle with JSON, Python dictionaries, or anything.

Getting a specific official, district, or election event by its unique ID – in proper ReST fashion – requires placing this numeric ID directly in the root URL, not the query string as another keyword argument – ie, /official/123, not /official?id=123. This makes sense to someone familiar with ReST – you’re requesting a specific resource, and that should be part of the Uniform Resource Locator – but has easily tripped up beginners in the past who expect ID to be just another query string parameter. python-cicero resolves this by having all queries be composed of keyword arguments passed to any of our wrapper methods, including ID. We check for it’s presence and construct the URL appropriately without burdening the user:

Documentation Is Important

A key part of all developer-focused software is having good documentation. You won’t be around to explain how to use it to everyone, so you’d better write that down and write it down clearly. A stalwart in the Python world is the Sphinx system for generating docs. It’s a great tool, but I feel it’s a bit bloated for smaller projects. Also, I don’t like writing in reStructuredText as Sphinx requires and find Markdown to be a bit more intuitive. Furthermore, I personally really appreciate being able to see code alongside my docs, following along in each.

So I was very happy to find a lightweight alternative Python documentation generator, Pycco – a Python port of the Javascript-focused Docco. Pycco let’s you write docs as Python docstrings formatted in Markdown:

Then, run Pycco against your Python source files with one command:

$ pycco cicero/cicero_response_classes.py

And beautiful, fancy font, syntax highlighted HTML documentation pages pop out – code on one side, docs on the other. Easy!

Try it Out

If you’d like to give Cicero a try, python-cicero is now one of the easiest ways to do it. Either use Python’s “easy_install” utility or the (superior, if you have it) pip to install the wrapper:

$ easy_install python-cicero
$ #OR
$ pip install python-cicero

Take a look at the docs, available at http://azavea.github.io/python-cicero/ to get a sense of the methods available to you, as well as the “cicero_examples.py” file in the package.

And again, keep an eye out for my upcoming Labs post – we’ll dive in to the more-complex-than-necessary world of creating Python packages and submitting them to the Python Package Index, as I did with python-cicero, with a full tutorial! It should be ready to go this week.

Securing GIS applications with SSL and HSTS

In building the new version of our HunchLab product for crime forecasting we are very concerned about security.  Police departments work with data sets that not only require privacy but can also expose individuals to harm if the data is disclosed publicly.  As application developers we often focus on security within the data center.  While making sure that our application properly handles SQL injection attacks and maintains the proper firewalls is critical, the weakest security component is often the client themselves.   Here are a few tips to improving your application’s security outside of the data center.

SSL is not so simple

When we browse the web, we know that when we see the lock icon in the browser tab our communication is encrypted and therefore “secure”.  But not all TLS/SSL is actually secure.  In researching the topic for our development work, I encountered quite a scary reality.  More than 20% of websites support a version of the SSL protocol that is completely broken (SSL v2).  So what to do?

TLS 1.0 is supported by nearly all browsers — a notable exception is Internet Explorer 6.  If you can live without IE6 support (and I pray that you can) then there is no reason to support SSL version 2 or 3 in your application.  You should also enable TLS 1.1 and 1.2 for browsers that support these newer standards.

If you are operating a highly sensitive application and can control which browsers you support, then disabling TLS 1.0 and selecting the right options can also improve security.  I found the cipher chart and browser support charts on Wikipedia to be immensely helpful.   For HunchLab 2.0 we decided to only support TLS 1.1 and 1.2 since we are dealing with sensitive data.

These newer version of TLS give you the option of supporting ciphers with forward secrecy.  For instance, assume you have typical SSL settings turned on and the NSA logs all of the traffic to your website.  Perhaps they can’t decrypt the traffic today, but if at any time in the future they get a copy of your private SSL certificate key, then they can decrypt all historic traffic.  Forward secrecy prevents this issue by generating ephemeral keys for each connection.  Look for settings for ephemeral Diffie-Hellman (DHE) and elliptic curve Diffie–Hellman (ECDHE).

A great way to learn about these settings and see how your application does is to use this SSL Tester.

HTTP Strict Transport Security

Another way to improve security is to tell the user’s web browser to only use encrypted connections.  This is really easy to do via the HTTP Strict Transport Security (HSTS) response headers.  If you support SSL within your application for all URLs then there is no reason to wait to do this.  It’s a simple header.    Chrome, Firefox, and Opera currently support this standard.

Tips:

  • Support only TLS 1.0 or newer (TLS 1.1 or newer if you can)
  • Turn on TLS 1.1 and 1.2 and configure forward secrecy
  • Enable HSTS to force browsers to use HTTPS

Open Data from OpenTreeMap: Visualizing temporal data with CartoDB’s Torque

This entry is part 2 of 2 in the series Visualizing Open Data from OpenTreeMap with CartoDB

I just wrote up a meaty Labs post on my idea to visualize tree, species, and user edits over time within exported data from PhillyTreeMap.org, and already covered all the joining, formatting, converting, and uploading necessary to get to this point, along with some simple visualizations at the end. If you haven’t read it, go ahead. I’ll wait here. Because with this post I’m diving straight in to the temporal visualization features of CartoDB’s Torque.

Briefly, though, to reiterate: What are my goals for visualizing the 2 years of PhillyTreeMap user edits over time? I wanted to create something parallel to Mark Headd’s homicide data visualization (also done with Torque) but that told a story over time that was more uplifiting. (What’s more uplifting than trees?) I also hoped my visualization would give us a rough idea of what neighborhoods and areas around Philadelphia have the most active PhillyTreeMap user edits, as well as what times of year seem most active. One could use that knowledge to determine and plan where or when to do outreach about PhillyTreeMap or the programs of our partners, like PHS Tree Tenders. What neighborhoods don’t have many user edits? When does participation drop off? On the flip side, where and when are urban forestry efforts succeeding in engaging the community? A time based spatial visualization can help us answer those questions – and look really cool in the process!

One final caveat: it’s important to note that Torque is under very active development at CartoDB. I was looking for pointers as I was writing this blog, and the folks at CartoDB including Andrew Hill were very helpful on the mailing list and would be happy to answer other questions you have. But they told me the next generation version of the library is due to come out “soon”, with better documentation, and it may differ greatly from what I write about below. The visual effect of time based data in Torque is just so cool though, that I couldn’t wait!

Testing and Tweaking

CartoDB have set up a number of Torque demos right on Github Pages. You can look at their demo data, or plug in your CartoDB user, table, and column name into the options sidebar to visualize your own date-based data. My “plots_and_trees” table (that I created in the last blog) is set to public (yours must be as well if you want to use Torque, as currently the library doesn’t do password authentication), so feel free to use it if you wish: User “andrewbt”, table “plots_and_trees”, columns “tree_last_updated” or “plot_last_updated”.

The Torque demo gives you a number of options that affect the visual effect of your visualization, but because it’s in development I couldn’t find any good explanations of what they are or how best to use them. So I wrote some up. Congratulations, dear reader, you get to enjoy the fruits of my copious amounts of experimentation.

  • Resolution: Here you have a choice of doubling numbers from 1 to 32. What does it do? This effectively changes the granularity of the data points CartoDB will stream from your table to Torque. Or, as Andrew Hill explained on the Google Group, “resolution relates to the actual X, Y dimensions that data will collapse to coming from the server and drawing to pixels.” Point is, I noticed lower values seemed like they would more accurately reflect the location of the actual record, whereas larger values created a larger data point “dot” that gave a looser indication of actual location. It may be the case that for very large datasets, a larger resolution would make the animation faster or smoother. However, a resolution of 1 or 2 is fine for our PhillyTreeMap table. (more…)

Open Data from OpenTreeMap: Visualizing tree data with CartoDB

This entry is part 1 of 2 in the series Visualizing Open Data from OpenTreeMap with CartoDB

Update 12:30pm, 8-16-2013: CartoDB is working on a fix for the WKT issues I stumbled upon in this blog and tweeted a workaround. Thanks Javier!

Many months ago, after the City of Philadelphia released some of its Part 1 Crime Incident data on OpenDataPhilly, I read a blog post by our very own Chief Data Officer Mark Headd where he visualized 6 years of homicides in the City of Brotherly Love on a temporal map using CartoDB’s Torque library. While the story the map tells is an important one, it is also depressing and sad – every second, as you watch, more dots appear on your screen representing way too many homicides in our city:

I was talking with a friend outside Azavea about Headd’s visualization, and posed a question: “What positive, uplifting change over time in our city could we tell the story of?” I sometimes get the feeling that so much data and visualizations of it are negative or otherwise shock us: from our struggling education system, to stolen bikes, to the disparate impact of voter ID laws. While visualizations like these uncover important stories to tell, so much sad news (for me at least) can sap my motivation to help fix it all. We need to visualize the good and give praise for what’s working, as much as we should analyze the bad and criticize what still needs to be done.

Hearing my frustration, my friend asked, “What about tree plantings or something?”, I assume without even realizing the connection she had just made in my mind.

Of course! That’s it! I happen to work for Azavea, where we craft OpenTreeMap, the best open source public tree inventory software around! I knew I could easily export data from PhillyTreeMap.org for almost two full years worth of ongoing, crowdsourced tree inventory and edits to the map in Philadelphia. We know that having more green, leafy trees and nature around make people happier psychologically, increase property values, clean our air and water, and save electricity and our environment. This was going to be a fun project.

Open, really open

Usually I think of the “Open” in “OpenTreeMap” as referring to the fact that it’s open source software. But there’s no reason that word “open” can’t be referring to open data as well. When someone adds a tree or tree details to an OpenTreeMap site, they are creating new data. We have always been big proponents of open data at Azavea, having originally built the OpenDataPhilly catalog ourselves. We know data can be reused and analyzed in more ways than we ourselves can imagine or hope to build into OpenTreeMap itself. (Remember Amos Almy’s project?) So OpenTreeMap follows this philosophy, and doesn’t lock its data away once users collect it. Right next to the main map of every OpenTreeMap site are three little links: “Export this search: KML | CSV | Shapefile”.

Each of those links allows anyone to download the results of a search for a specific species of tree, the trees in a particular neighborhood, or even every tree and plot in the system, in three different widely-used geospatial data file formats. From there, you can use desktop analysis tools like ArcMap, QGIS, or Google Refine and Earth, or cloud services like Google Fusion Tables and CartoDB to filter, query, and visualize your collected tree data.

So, I went to PhillyTreeMap.org and waited eagerly as I downloaded a zipped file containing three CSVs with details on the 183,758 plots, 56,310 trees, and 300 species on the map. More on why I picked CSV versus KML or Shapefile in a moment…

(more…)