Modeling Count-based Raster Data with R and ArcGIS

Modeling Count-based Raster Data with R and ArcGIS

We’ve recently been working with the team at the Rutgers Center on Public Security to build a desktop utility that helps crime analysts build robust and statistically valid models of the risk of crime at locations within their jurisdiction.   These models are  useful in predicting the levels of crime at different locations so that a police department can better allocate resources.  While we are working with crime data, the process of modeling the number of events that happen in a given geographic area for a given time period has broad application.  Whether you are modeling the density of trees across a landscape, the number of cell phone calls in different neighborhoods, or the number of crimes on a street, the goal is to explain the rate of events as a function of the nature of the location.

GIS analysts are often familiar with the regression models that are available within ArcGIS for Desktop such as ordinary least squares (OLS) or geographically weighted regression(GWR).  If you are not familiar with them, I’d encourage you to take a look at Esri’s online training seminars.  It’s good stuff.

While these techniques are often useful, they assume a normal distribution of the response variable within the model.   If the counts within your units of analysis are large enough, it is possible to use these techniques as an approximation of the underlying distribution.  That said, count data is fundamentally not a normal distribution and there are better options that should be used.

The problem with regression models that assume a normal distribution quickly becomes apparent as the unit of analysis shrinks.   If you are modeling data at a fine geographic resolution such as in a raster, you will often have many cells that have no events and a low average count across the cells.  In these situations it is utterly incorrect to use an OLS or GWR model within ArcGIS.

Instead, ArcGIS can be used to geographically process your data to a set of counts and variable values within each raster cell.   This data set can then be exported and analyzed in statistical software packages that provide more appropriate models.   For example, count data can often be represented as a generalized linear model. The free and open source R project provides many packages that can build such a model.   To learn more about how this process can work, take a look at the presentation.