In analyzing crime, many police departments visualize recent areas of concentrations in crime and then focus resources on these hotspots in an effort to suppress crime. Such a process is based upon the assumption that crimes will occur where they occurred recently and does not seek to determine (statistically) to what degree this assumption is true. For a number of years, the academic community has written about predicting crime using not only crime data but also other correlated data sets that may help to explain why crimes emerge at particular locations. Even back in 2002, Elizabeth Groff and Nancy La Vigne published a paper that surveyed methods to predict crime and that paints a picture of the future of crime prediction. It’s a great read. Yet only recently have many departments begun to explore crime prediction in any real manner.
One of the methodologies that Groff and La Vigne outlined in 2002 is the concept of representing criminal risk via the proximity of geographic features — the context within which crime occurs. Over the past few years, such an approach has gained traction within law enforcement thanks, largely, to the efforts of Joel Caplan and Les Kennedy at Rutgers University in the context of their Risk Terrain Modeling (RTM) outreach.
As part of R&D work for our HunchLab product, late last summer I began exploring how to automate the RTM approach to forecast crime. In collaboration with William Huber, I developed a way of maintaining the desired simplicity and explanatory power of the RTM approach while increasing the robustness of the statistics. Here is how the approach works.
The objective of RTM is to represent criminal risk by the proximity and density of risk factors such as bus stops, bars, and gang members. The role of statistics in the RTM process is to determine the significance and spatial reach of the risk factors in order to develop a model that balances simplicity with accuracy. Our units of analysis are cells within a raster covering so that we can quite literally model the “environmental backcloth” of crime. You can find more information about such a raster approach in this related blog post.
We are given a collection of potential risk factors — points in space that represent crime generators and attractors such as bars or bus stops. For each factor we want to determine the geographic reach of its effect, so we build a series of variables that measure whether each raster cell is within a certain distance of the risk factor or in an area of high concentration of the risk factor. For example, we calculate whether each cell is within 500, 750, and 1000 feet of a bar. We assemble these values into a table where rows represents cells and columns represent binary variables. If presented with 10 risk factors, we might expand that to 80 variables that are variations of distances and densities. We also calculate the number of crimes that occur within each cell to use as our outcome variable.
Generating such a large number of variables opens us up to problems with multiple comparisons, in that, we may uncover spurious correlations simply due to the number of variables we are testing. To address this issue, we use cross-validation to build a penalized Poisson regression model. You can accomplish this process in R using the penalized package.
Penalized regression balances model fit with complexity by pushing variable coefficients towards zero. We select the optimal amount of coefficient penalization via cross-validation and bypass the use of statistical significance tests to select our variables. This process might reduce our set of 80 variables to a smaller set of 20 variables with non-zero coefficients. It is important to note that using the model resulting from this step, the penalized model, would be perfectly valid. All 20 variables are playing a useful (“significant”) part within the model.
Since our goal is to build an easy to understand representation of crime risk, we decide to further simplify the model.
A Simpler Model
To build a more parsimonious model, we use a bidirectional stepwise regression process. We begin with a model (the current candidate) with no variables and measure the Bayesian information criterion (BIC). The BIC score balances model complexity against fit. We try individually adding each of our 20 variables to the null model and measure the resulting BIC scores. The model with the best (lowest) BIC score is selected as our new candidate model. We then repeat the process attempting to add and remove variables one step at a time to improve our BIC score. When no step improves our score, we stop. In our example, this process may take our 20 variables and result in an optimal model containing 4 risk factors.
You can accomplish this process with many packages in R including the built-in glm() and step() functions. You can also enforce rules within the stepwise regression by writing a function to customize the behavior. For instance, we can require that at most one variable that represents the influence of bars is allowed to enter the model.
Outputs & Criticisms
Our process results in an easy to understand model that selects not only the significant risk factors but also their optimal spatial influence. Generating a map of crime risk is simply a matter of applying this model to our variables (or updated values of our variables if we so desire) to generate predicted counts within each raster cell.
The statistically inclined may immediately bring up a few points about this approach that I hope to address here.
Even if you are modeling several months worth of crime data, many of your cells will contain zero events which may worry you. For instance, in aggregating a crime data set of 467 events into 36,752 raster cells, I measured 98.8% of the cells being zero. At first glance, this seems like a problem. How many cells should be zero, however? Let’s consider crimes a Poisson process. If we evenly distributed crimes among our cells we would have an average of 0.0127 crimes per cell. We can simulate draws from a Poisson distribution with this mean value to determine the percent of times it is 0: 98.7% of the time.
This modeling process does not attempt to incorporate the spatial relationships between cells in the regression itself. This criticism often focuses on the fact that regression models assume independent observations and that significance values will be inaccurate when spatial autocorrelation is present among the observations. Our process deemphasizes the use of significance tests for variable selection in favor of cross-validation. Further, most spatial regression packages do not support Poisson data with low values. We had to choose the battles we wanted to wage; no model is perfect.
For the scripting savy, it is nice that this entire process can be accomplished (with some effort) using the free R project and related packages.
What are you to do if you are not an R ninja? We’ve nearly completed the development of an automated utility that uses this methodology for Rutgers University. This utility will be released in the coming months, so stay tuned to the RTM listserv for more information.