When Mapping Quantities, Choices Matter

An article just came up last Friday on Technically Philly that mentioned one of the winning projects from Azavea’s Open Data Philly Visualization Contest, a bike theft study by Greg Kaminsky. Greg chose to look at bicycle theft in Philadelphia, which was similar to a 2013 Summer of Maps project I completed for the Bicycle Coalition of Greater Philadelphia. Greg’s takeaway was that the highest number of bicycles stolen from one single location was 15 and this occurred right outside of City Hall. While that might be true based on the geocoding and visualization techniques used, it seems that more complexity regarding clustering of thefts exist in the data. Perhaps there are more significant clusters (though not falling on an exact point) located elsewhere. Based on my previous work in the area, I knew several areas other than City Hall also had high rates of theft, so I decided to explore some other metrics for analyzing data clusters.

Greg wasn’t incorrect, but his results demonstrate how the choices cartographers make during map creation can greatly affect the results. The most bike thefts in a geographic location varies across the city of Philadelphia depending on a number of factors, such as geographic scale or the normalization used. How we define geographic location is very important. For example, we could decide to use small buffers around theft incidents, look at street corners, blocks, city council or police districts, census tracts, and on and on.

I wanted to explore how the choice of geographic level by which to aggregate thefts affects where apparent clusters of theft exist. Let’s take a look at the full set 10,747 reported thefts over six years:

It’s not immediately very easy to understand exactly where the highest amounts of thefts are. The huge number of points ends up making everything far too busy. Now let’s see which locations have had the most bikes stolen from them based on the type of geographic clustering or aggregation we use.

Clustering by geographic boundary

1. Clustering based on City Blocks.

For this method, I identified the city block each theft fell within or was nearest, and summed the thefts per block. What we have is a map of thefts over a six-year period from January 1, 2007 to December 31, 2012 by block across Philadelphia. It appears that high-theft areas have been narrowed down considerably. The thefts per block range from two to twelve, with darker colored blocks having higher amounts of theft. Areas of high theft are immediately apparent near Temple University, University City, Center City, and in far south Philadelphia. Keep in mind, some blocks are larger in size than others, so this should not be considered an approximation for density.

The only blocks with more than ten thefts over the time period (January 2007 – December 2012) are located in University City:

2. Clustering based on Census Tracts

Census tracts are areas delineated by the Census Bureau to optimally contain about 3,000 – 8,000 people each, across most of the United States. Tracts are about 1/3 the size of a typical Philadelphia neighborhood, and are extremely useful for visualizing and interpreting thousands of Census Bureau variables for demographic analysis. Here we see the highest-theft areas by census tract.

The Census tract that makes up most of University City is yet again the highest theft area. This is the only tract with more than 300 reported thefts over the 2007-2012 period.

3. Clustering based on Neighborhoods

Neighborhoods are fun to use for this kind of analysis because they’re recognizable places with names and characters we can identify easily. This method again repeats the aggregation methodologies from before, just at a larger level. The highest theft neighborhoods are Washington Square West (530 thefts), Rittenhouse (790 thefts), and University City (816 thefts).

It’s interesting to see how the top theft location changes based on what kind of geography we choose to summarize the data by. Here’s another wrinkle: some of these geographies have much larger perimeters or areas, sometimes because they include parks, cemeteries, or have to cover a larger geographic area to include a sufficient number of people to fit the Census requirements.

Clustering by Proximity

Maybe aggregation based on blocks, tracts, or neighborhoods isn’t the best way to measure “a location”. Perhaps the reason a certain area is hit more often by bike thieves is because of values present only in the immediate area, such as a vacant building, poor lighting, or proximity to an easy exit. One common operation in GIS is buffering, where a circle or shape is drawn around a point at a given thickness. If we draw buffers of a given distance around every single theft, and then count how many thefts fall inside each circle, we can find the circles or areas that have the most thefts. Let’s try this out at a few different distances.

100 Foot Buffers

When 100 foot buffers are used, only a few areas in the city show up with buffers that contain more than eight thefts in a 100 foot radius over the six year period.

300 Foot Buffers

Now let’s bump up the buffer radius around each theft to 300 feet and then see how many fall inside each buffer. Interestingly, it seems that the high theft areas have shifted. The only buffers with more than 15 thefts each are located at Walnut and Broad, 9th and South St, and 13th and Walnut:

500 Foot Buffers

When we enlarge the buffers to 500 feet so they contain about one square block, the high theft areas shift yet again, with the largest buffers all containing 25-32 thefts. Most of the thefts are now centered around the blocks of Walnut Street on either side of Broad Street, with another high area still at 9th and South Street:

Advanced Clustering

What if the location with the highest amount of thefts isn’t a discrete, arbitrary circle somewhere in the city, but rather a contiguous area with similar charactersistics? Cluster analysis looks for statistically significant and contiguous clusters of areas with similar values. When we run run cluster analysis on the theft-by-block file, using inverse manhattan distance, we get this:

This map shows three kinds of clusters: Clusters where high amounts of thefts happened in a block surrounded by low-theft blocks (HL), high amounts of thefts happened in a block surrounded by other high-theft blocks (HH), and low amounts of thefts happened in areas surrounded by high-theft blocks.

Welp. Better lock up your bikes, Philadelphians.

Additional Considerations

Other things to consider: Time: All of these operations used all six years’ worth of data. What if I’d just used 2010? 2012? The clusters, hotspots, and prime locations would probably be entirely different! All of this map data was calculated using a projected coordinate system, North American Datum 1983 State Plane Pennsylvania South. In plain English, we chose to use a warped measurement system that would reduce distortion and preserve certain attributes (distance, area) at the local level. The default Tableau map projection is Web Mercator, which is a projection system used to make 256×256 pixel tiles, and it distorts accuracy the further away from the equator the measurements are being taken. This might account for Greg’s cluster of 15 points at a small location that I couldn’t replicate. Another thing to consider is geocoding inaccuracy. If many addresses were unable to be geocoded to the exact address, they might have defaulted to city name (Philadelphia), which could also account for multiple locations falling at the exact same point near City Hall (perhaps a commonly used location for geocoding to “Philadelphia”).

Cartography is both an art and a science. The decisions the cartographer makes hugely inform the results. With the popularity of easy web-mapping tools like CartoDB, and other tools that provide light mapping capabilities like Tableau, it’s more important than ever to be a discerning producer and consumer of cartography.