Machine Learning with Python, Scala & R

A few months ago we set out to create a new system to forecast spacetime events that would be scalable, adaptable, and accurate. Such a system has many potential uses such as modeling global conflict or forecasting crime. To accomplish our objectives we required tools with deep capabilities. For instance, processing geospatial data of significant scale in a timely manner is a great fit for our GeoTrellis project. Ready access to sophisticated modeling and machine learning capabilities is a natural fit for projects such as R and Apache Mahout. How, though, can such diverse tools function together? CSVs.

While machine learning and “big data” may be all the vogue in the analysis community, at the end of the day these systems mostly boil down to a string of components that each transform tabular data into new tabular data. While there are more efficient data formats that can be utilized with specific tools, by focusing on CSVs when possible, we designed a system that is modular and allows individual components to be replaced at whim. A loosely coupled system allows greater flexibility for distributed and resilient processing since each step in the workflow can be monitored for success while work can be spread across multiple servers.

Our approach to predicting spacetime events requires a few steps:

Process outcomes (events) and covariates into tabular format
Build a model
Make predictions using the model

Processing Training Data

In our case, the outcomes we want to predict are events in space and time. Our covariates might be prior knowledge of event frequencies at a particular location, the distance or density of geographic features, or temporal variables such as the day of week and weather. We decided to use Scala and GeoTrellis to fulfill these needs. Our inputs to this step are CSVs (events, temporal variables), Shapefiles (points, lines, polygons), and GeoTIFFs (rasters). Our outputs are simply CSVs.

In the process of working on this project, we extended GeoTrellis to support some geographic operations over rolling temporal windows. For instance, rapidly calculating counts and Kernel Densities for the last 28 days for every time period within a range. We also implemented a simple raster to CSV conversion operation to support the use of diverse modeling packages alongside GeoTrellis.

Model Building

For machine learning, we are leveraging the R project and Apache Mahout. While our developers were initially skeptical to use R in a production system, we’ve been pleasantly surprised with its scale and speed. Many R packages contain C++ or Fortran code under-the-hood that is quite fast. R also allowed us to easily generate graphical output to examine our results.

We also spent some time working with Apache Mahout. While Mahout will scale much more than R, we found that R was better for rapid prototyping and handled most of the scale we needed for this project (with some tricks). If we need greater scale in the future, we can incorporate Mahout more directly without changing our overall workflow.

Predictions & Glue

Measuring accuracy in meaningful ways is an essential part of predictive modeling. Ideally, it should be easy to iterate over new ideas and measure how your results progress. Since we built many individual components for the project, we needed a tool to glue everything together and to allow a user to easily accomplish particular tasks. We decided to use Python to build a command line tool for this part of the project.

To summarize, here is our toolchain:

Python – primary interface to the project; glue that combines different tasks into simple commands
Scala / GeoTrellis / R – geoprocessing and data preparation
R / Mahout – modeling

I highly recommend such a loosely coupled approach to others working on geospatial modeling. The flexibility of this approach has worked well for us and allowed us to use best-of-breed tools across all of our requirements.