How to Manipulate Big Data for Use in ArcMap

Now in its second year, Azavea’s Summer of Maps Program has become an important resource for non-profits and student GIS analysts alike. Non-profits receive pro bono spatial analysis work that can enhance their business decision-making processes and programmatic activities, while students benefit from Azavea mentors’ experience and expertise. This year, three fellows worked on projects for six organizations that spanned a variety of topics and geographic regions. This blog series documents some of their accomplishments and challenges during their fellowship. Our 2013 sponsors, Esri and Tri-Co Digital Humanities helped make this program possible. For more information about the program, please fill out the form on the Summer of Maps page.

How to Manipulate Big Data for Use in ArcMap

I worked on a project this summer with Sunlight Foundation to map and analyze federal political campaign contributions from individuals from 1990-2012 – twelve 2-year cycles of data – on the county level. This translated, in terms of data, to 25.7 million records in a single database. Given the immensity of the project’s scope and, therefore, its data, I needed to use different methods of working with this large dataset than I normally would have with a few thousand records.

The first thing I wanted to do with the data was geocode the contributions by the contributor’s address. Of course, that meant that the contributions needed to have addresses associated with them, be it a zip code or a complete street address. However, many of the records either did not have address information or came from an international donor, which meant I needed to clean the dataset before I could geocode it. Yet Excel, the software I would ordinarily use to clean data, could not even open just one cycle, let alone all 12 together. Therefore, I decided to use the open-source program PG Admin, a PostGres administration program, to manage the dataset and prep it for geocoding.

A colleague from Sunlight uploaded the 12-cycle file to a PostGres database so I could use it in PG Admin, which eliminated any potential problems we might have had with getting the data from their office in DC to ours in Philadelphia. I then used SQL queries to create a new table for each 2-year cycle and eliminate contributions without address or with an international address. This left me with 21.9 million geocodable contributions. I then exported the 12 tables to geodatabase tables for use in ArcMap.

I tried geocoding the data first using the basic locators that come built-in to ArcGIS. For 2012 alone – the cycle with the most contributions at 3.9 million – the geocoding would have taken 3.5 days. I instead used the NAVTEQ-based version of StreetMap P remium for ArcGIS, which shortened the time to 15 hours for the 2012 cycle. That seemed like a good deal in comparison.

After the geocoding was done, my next step was to match each contribution with the county it falls within and sum the contribution data by county for both the number of contributions and the monetary sums of various subsets of the data so that I could analyze it. I decided to join polygons to points, aggregate all the subsets of data by county, and then join the data to a county shapefile.

I chose to join polygons to points instead of joining points to polygons because, while it is a longer process, it would result in cleaner datasets and stamp each contribution with the FIPS code of the boundary it falls within. Plus, Sunlight wanted to know which county, congressional district and census tract each contribution fell within, even though I was only doing my analysis on the county level. With a points to polygons join, I could provide them the contribution datasets with FIPS codes for all three legislative boundaries.

After stamping the data (a process that took several hours for each cycle), my next step was to summarize the contribution data by county. First, I created Boolean fields in the stamped contribution files for each subset of data I wanted to summarize. The summarizing process itself can be done with the Summary Statistics tool for summarizing an entire field, such as the number of contributions or the total contribution dollars. But for my analysis, I needed to summarize by subsets of fields. For example, the ‘party’ fields had both Republicans and Democrats, and I wanted to summarize each of them separately by the donation amount. With the help of a user on GIS Stack Exchange, I modified a python script to complete that complicated join that outputs a table with all the sums.

I then joined the output table for each cycle to the historic county shapefile it corresponds with so that the joining would be as accurate as possible. The data could then be analyzed just like any other type of county-level data.

It was incredibly daunting to work with such a large dataset at first, but by using PG Admin to clean the data, geocoding with StreetMap Premium and stamping, aggregating and joining the data, I was able to complete my analysis for Sunlight Foundation with ease.