Summer of Maps Fellow Nat Henry wrote a Python script to collect the entire universe of available Philadelphia License and Inspections Department data and hosted it on Github as part of a project for the Legal Clinic for the Disabled. There were more than 1 million inspection records in the database from January 2003 to August 1, 2014. The combined CSV text file is about 330mb in size, and balloons to more than 5 gigabytes when converted into a shapefile. A CSV that large will frequently crash Excel, and a million record file starts to slow desktop GIS noticeably (and also quickly fill up a hard drive with intermediate files). I wanted to try and use the spatial Postgres plugin, PostGIS to more efficiently clean up and query the data.
You’ll first need to download the data from the Summer of Maps Github page specifically for Licenses and Inspections. After you unzip the data locally on your computer, you’ll see that it has been split into a 12-part file, and a 3-part file. I decided to use the 3-part file, and used Excel to copy-paste the data into one spreadsheet. You’ll now have one unwieldy 330mb CSV. I already had the Philadelphia Neighborhoods shapefile from Azavea loaded into my database. Now let’s get the inspections data into a PostGIS-enabled Postgres database.
In a previous blog post I described how to use QGIS to load spatial data into PostGIS databases. I followed that same method here, loading the CSV into QGIS and converting it to a shapefile, and then importing it into PostGIS with QGIS’ SPIT function. There are definitely other and better ways of doing this, but using QGIS is one of the most straightforward.
If all goes well, you should now be able to use QGIS’ DB Manager utility, or PGAdmin to explore your database and begin writing SQL queries. First I dropped a number of unnecessary fields to cut down on clutter and reduce table size:
ALTER TABLE incidents_wgs84 DROP "ViolationI", DROP "ViolationL", DROP "CaseNumber", DROP "LocationID", DROP "CaseType", DROP "CaseLocati", DROP "ViolationD", DROP "LastUpdate", DROP "Resoluti_1", DROP "StNum", DROP "StDir", DROP "StSuffix", DROP "field_28"
There are over a million records in the database, so I wanted to limit my queries to any inspections from 2014 and later. This turned out to be much more difficult than expected, mainly because the date is stored in a Unix format called, ‘Epoch’, where the dates are stored as milliseconds since 1970, not to mention the field had some extra characters that needed to be removed.
I removed the extra characters from the “AddedDate” field with the following two queries:
update incidents_wgs84 SET "AddedDate" = REPLACE("AddedDate", '/Date(',''); update incidents_wgs84 SET "AddedDate" = REPLACE("AddedDate", ')/','');
Now we should just be left with a giant string of numbers, still stored incorrectly. The next query will convert the string to a big integer (to store huge numbers), and then convert to a timestamp. It also uses a conditional regex statement I found on StackExchange to handle null value records so they don’t break the query.
I start out queries using a “WITH” statement. This is from a special syntax called Common Table Expressions (CTE) that allows you to write more procedural-style SQL queries. While less efficient, the queries are much easier to read and construct. You can learn more about CTEs at Craig Kerstiens’ blog.
WITH clean_table AS (SELECT *, to_timestamp("AddedDate"::BIGINT/1000) FROM incidents_wgs84 WHERE "AddedDate" ~ E'^\d+$')
This cleans up and converts the date field into an SQL variable called “clean_table” which I’ll continue to build on in this post.
This next query uses the “clean_table” and selects only those dates after December 31, 2013 to narrow down the number of records:
recent_violations AS (SELECT * FROM clean_table WHERE to_timestamp > '2014-12-31')
When I put both queries together, it should look like this:
WITH clean_table AS (SELECT *, to_timestamp("AddedDate"::BIGINT/1000) FROM incidents_wgs84 WHERE "AddedDate" ~ E'^\d+$') recent_violations AS (SELECT * FROM clean_table WHERE to_timestamp > '2014-12-31')
Great, now we can build on those two queries to start actually asking questions about the data.
One of the best parts about Postgis is that you can ask questions fairly quickly without any of the interstitial cruft that desktop GIS tends to accumulate. If you catch a mistake or want to change something in PostGIS you can just modify the code and run it again. Meanwhile, in desktop GIS you might create five different intermediate layers only to realize you made a mistake and have to repeat everything. Let’s build some Postgis queries to summarize the 2014-2015 L&I data.
This query performs a spatial join between the L&I violations and the Philadelphia neighborhoods shapefile. The
nested inside the spatial intersect (ST_Intersect) ensures that the incidents are projected into the local Philadelphia state plane coordinate system, EPSG 2272, to match the neighborhoods shapefile. The full query is below:
WITH clean_table AS (SELECT *, to_timestamp("AddedDate"::BIGINT/1000) FROM incidents_wgs84 WHERE "AddedDate" ~ E'^\d+$'), recent_violations AS (SELECT * FROM clean_table WHERE to_timestamp > '2014-12-31') SELECT hoods."MAPNAME", COUNT(recent_violations.gid) FROM recent_violations JOIN "Neighborhoods_Philadelphia" AS hoods ON ST_Intersects(ST_Transform(recent_violations.the_geom,2272),hoods.the_geom) GROUP BY hoods."MAPNAME" ORDER BY COUNT(recent_violations.gid) DESC Limit 10
And here’s the result (I copied the table and used the Tables Generator website to convert it to Markdown):
|West Oak Lane||1759|
It’s great to see the number of violations per neighborhood, but I wanted to dig a little bit deeper. For instance, what is the most frequent violation per neighborhood? While Postgres doesn’t have a built in mode (statistically most frequent) function built in, I approximated one with this helpful stackoverflow post. I continue to use the CTE query style, calling the spatial join I did in the previous step “AS spatial_join” to use the result later for the SELECT DISTINCT:
WITH clean_table AS (SELECT *, to_timestamp("AddedDate"::BIGINT/1000) FROM incidents_wgs84 WHERE "AddedDate" ~ E'^\d+$'), recent_violations AS (SELECT * FROM clean_table WHERE to_timestamp > '2014-12-31'), spatial_join AS ( SELECT hoods."MAPNAME", COUNT(*) as violcount, "Violatio_1" FROM recent_violations JOIN "Neighborhoods_Philadelphia" AS hoods ON ST_Intersects(ST_Transform(recent_violations.the_geom,2272),hoods.the_geom) GROUP BY hoods."MAPNAME", "Violatio_1" ORDER BY violcount DESC) SELECT DISTINCT ON ("MAPNAME") spatial_join."MAPNAME", spatial_join.violcount, spatial_join."Violatio_1" FROM spatial_join ORDER BY spatial_join."MAPNAME", spatial_join.violcount DESC, spatial_join."Violatio_1"
This query will return a table with every single neighborhood in Philadelphia that had recorded violations, with the count of and name of the most frequent violation. The following table is random selection from the most frequent violations per neighborhood. It appears that the most frequent violation per neighborhood across the city is usually a citation by the Community Life Improvement Program (CLIP), an anti-blight and community beautification effort by the City of Philadelphia.
|Neighborhood||Most Frequent Violation||Count|
|Dickinson Narrows||CLIP VIOLATION NOTICE||13|
|East Mount Airy||CLIP VIOLATION NOTICE||102|
|Greenwich||EXT A-VACANT LOT CLEAN/MAINTAI||51|
|Haverford North||CLIP VIOLATION NOTICE||19|
|Industrial||CLIP VIOLATION NOTICE||4|
|Nicetown||LICENSE-VAC RES BLDG||19|
|Northern Liberties||ANNUAL CERT FIRE ALARM||25|
|Packer Park||EXT A-CLEAN WEEDS/PLANTS||2|
|Parkwood Manor||CLIP VIOLATION NOTICE||88|
|Powelton||FIRE PROT SYS MAINTENANCE||4|
Luckily, the table also has a “Priority” field, where violations are classified in a few different categories. I modified my original spatial join query to add a WHERE clause that looks for the “IMMINENTLY DANGEROUS” priority:
WITH clean_table AS (SELECT *, to_timestamp("AddedDate"::BIGINT/1000) FROM incidents_wgs84 WHERE "AddedDate" ~ E'^\d+$'), recent_violations AS (SELECT * FROM clean_table WHERE to_timestamp > '2014-12-31') SELECT hoods."MAPNAME", COUNT(*) as violcount FROM recent_violations JOIN "Neighborhoods_Philadelphia" AS hoods ON ST_Intersects(ST_Transform(recent_violations.the_geom,2272),hoods.the_geom) WHERE recent_violations."Priority" SIMILAR TO 'IMMINENTLY DANGEROUS' GROUP BY hoods."MAPNAME" ORDER BY violcount DESC
The resulting table contains an entry for every single neighborhood in the city with at least one Imminently Dangerous Violation from 1/1/2014 to 8/1/2015. I used QGIS to export the query result as a layer, and then brought the resulting shapefile into CartoDB through the handy CartoDB QGIS Plugin. One major benefit of using a PostGIS database to perform these kinds of spatial queries is space savings: the file I uploaded to CartoDB was only 140kb, which is amazing considering it started out as a 5.2 gigabyte shapefile. Here is the map of Imminently Dangerous Violations per Neighborhood on CartoDB:
The License and Inspection data that Nat scraped is a great resource for civic hackers and data analysts. But at over 5 gigabytes as a shapefile, it can get unwieldy very quickly. Using PostGIS to perform some initial transformations and simple queries is a great way to get started digging deeper into the data without filling up a hard drive. Finally, Common Table Expressions are an excellent way of keeping your PostGIS queries legible. Try taking PostGIS for a spin with the License and Inspections data. Write me @geotyler on Twitter if you have any questions.]]>