Repeatable Data Processing Workflows with Docker and Make

Repeatable Data Processing Workflows with Docker and Make

The Civic Apps team recently spent some time standardizing and improving our workflows, project setups, and team best practices. We’ve mostly focused on improving our software development work. However, like most teams at Azavea, we often have to process data, either for analysis or for use within web applications, and our data processing workflows lacked the rigor of our software development workflows.

In the past, we would process project data with one-off scripts that weren’t checked into the project repositories. Most developers didn’t have all of the necessary dependencies on their machines, even if the scripts were checked-in. Sometimes data was checked into a repository without any record of how it was created. All of these situations made it difficult to update processed data when we received new raw data.

As we worked to improve our processes around making repeatable, deterministic workflows for building and deploying our projects, we noticed that there was room for improvement in our data processing workflows. The data processing pipeline should be as repeatable and deterministic as the software build process.

Building a data processing pipeline

Software development teams at Azavea build their projects using Docker. Our development process sped up considerably when we switched to Docker from a Vagrant-based setup. We are also big fans of Make; an old build tool for running scripts and creating files. For a good guide to getting started with Make, read Mike Bostock’s Why Use MakeWe leveraged these two technologies to create a data processing container, which contains all of the dependencies need by the data processing scripts, as well as the raw data.

First off, when the container is built, Make runs and turns the raw data into something that is usable within the app. Then, that generated data is copied into the application container. Make will run and generate new data if any of the raw data has changed the next time the containers are updated.

Take a look at the sample script below. We run this script anytime we want to update the project containers or regenerate the project data.

# Install NPM modules
  docker-compose \
    -f docker-compose.yml \
    run --rm --no-deps app \
    npm install

# Build containers, which will install dependencies
  docker-compose \
    -f docker-compose.yml \
    build

# Generate data in data-processing container using Make commands
  docker-compose \
    -f docker-compose.yml \
    run --rm --no-deps \
    --workdir /usr/src/data data-processing \
    make all

# Copy data to app container
./scripts/copydata.sh

We hope to continue to implement this workflow in other projects and improve upon it as we learn more. Have you used Make with Docker on a recent project? Or have you implemented a different repeatable data processing workflow? Get in touch!