What the Heck Is … Spark?

What the Heck Is … Spark?

Ten Percent Time

After their first 6 months, most Azaveans can propose a learning or research project and spend up to 10 percent of their time pursing it.  These research projects are an opportunity for employees to explore new technologies, learn new skills, or work on a project that could benefit the community.  Some of these projects work out; some don’t; but, collectively, they help us both advance the state-of-the-art and stay on top of new technologies.

In December, a group of five of us began working on a group research project to evaluate and port existing machine learning and modeling functionality in HunchLab from R to Spark.  As far as research projects at Azavea go, this is a unique one with multiple people working on it from three separate software teams.  This type of cross-team collaboration provides an opportunity for the developers involved to work with and learn from one another as well as have a chance to advance a project more quickly than any of us could do on our own.

Why Spark?

Spark is a general purpose, cluster computing framework that is capable of aggregating and using the capabilities of hundreds and even thousands of machines in concert.  Modeling and analyses that were once unthinkable on a single machine can be done quickly using Spark.  Azavea’s GeoTrellis project was our first work with Spark and we achieved some significant performance milestones processing large amounts of raster data, but, prior to starting this research project it was unclear if it would be appropriate to use in a project like HunchLab.

HunchLab is a predictive policing platform from Azavea that, among other things, uses advanced statistical modeling and machine learning to forecast where crime is most likely to occur in order to help police departments allocate resources effectively.  The data requirements for this type of analysis are substantial — for a city the size of Philadelphia a typical analysis could include millions of observations.  Over the course of this research project it has become clear that Spark is more than capable of handling this data.  Additionally, by combining Spark with the MLlib machine learning library we were able to implement many of the forecasting models without having to re-write a lot of new code.


The results from this research project so far have been encouraging.  We are able to process more data, faster, and at a greater resolution than we had previously been able to in the standalone Scala and R code that is currently being used.  Once this research project has ended, the techniques learned and skills gained to help solve problems across a number of domains beyond HunchLab.  As the availability of real-time sensor data becomes more available and the need for geospatial applications to analyze this data grows Spark will be a great tool to have when working with this type of data.