Azavea and CloudFactory: Partners on Quality Training Data and Social Impact

Originally posted on the CloudFactory blog on October 4, 2019.

At Azavea, our mission is to create advanced geospatial technology and research for civic and social impact. That mission has led us to some interesting places – we’ve worked with the World Bank to try to reduce traffic accidents globally, created open source tools for applying machine learning to satellite imagery, and even testified in court based on our research into gerrymandering and how to solve it.

Aiming for civic and social impact in our work is so fundamental to our constitution as a company that we’ve written it into our charter. And as a Certified B Corporation, we participate in bi-annual audits on everything from our carbon footprint, to employee compensation, to our involvement in our local Philadelphia community.

two employees at Azavea sitting at desks — Azavea’s Philadelphia office

Suffice to say, in order to pursue these rather lofty ideals we find ourselves tending to take the long view when weighing business decisions… the really long view. In fact, we aim to be around in 100 years. Whereas most of the companies we compete with in the tech world are obsessed with what’s new, we’re more focused on the things that aren’t going to change over the long term. A century from now, we think people will still value evidence-based insights, excellent customer service, and delightful product experiences. To deliver on those promises for decades to come, we can’t possibly go it alone – so we are always on the hunt for partners who share our values and are aligned with our interests.

Engines need fuel, machine learning models need labeled data

The nature of our work often involves wrangling vast amounts of geospatial data, from measurements of stormwater runoff on the ground, to GPS data propagating through the air, to satellite imagery streaming down from space. We work to reduce these massive, noisy datasets to relevant information feeds that customers care about. Recent advances in machine learning (and tooling to apply new ML methods) have dramatically expanded the number of questions we can pose using these large geospatial datasets and simultaneously reduced the cost and time associated with asking those questions.

Over the past few years, we’ve seen the limiting factor for project throughput shift from ML engineering capacity to training data collection and generation. I liken our ML engineering organization to a luxury car engine that needs highly refined fuel to run efficiently…and we have a fuel shortage: a shortage of labeled training data.

Coworkers standing in a circle talking — An Azavea engineering team daily “standup” meeting

Initially, we decided to start by creating training data in-house. While it was helpful to have first-hand experience labeling the data, we quickly realized that the most effective way to scale would be to outsource to an expert. Thus, our search for a compatible labeling partner began in earnest.

CloudFactory stands out among data labeling companies

We interviewed a handful of leading data labeling firms and studiously compared their pricing, approach, and even their cultural values. CloudFactory stood out for a number of reasons.

A photo of 8 CloudWorkers taken in July 2019. — The team of Cloud Workers assigned to Azavea in July 2019.
Photo by CloudFactory.

1. Mission alignment – CloudFactory’s goal to create one million high quality jobs for people in the developing world struck us as both ambitious and exciting. One of Azavea’s core values is to ensure every employee genuinely believes their work “can contribute to a more peaceful, just, and prosperous world.” CloudFactory’s mission to empower people economically and connect them to the digital economy as directly aligned with that same core value. Not only is theirs a global mission but it’s paired beautifully with direct interaction with our team. We’ve come to trust and rely on our project leads in Nepal and our client success manager in Durham, North Carolina. We talk with our project leads every single day, and we convene the account and project leads on a check-in call every other week. Often a company’s grand vision for social impact doesn’t translate to an exceptional customer experience; in CloudFactory’s case, it absolutely does.

2. Mutual investment – One of the quirks of geospatial data is that it’s quite tricky to display and file formats can be…strange. We had tried other labeling tools but satellite, drone, and aerial imagery always felt like a second-class citizen in the user experience. So we built our own in-house tool, and CloudFactory not only enthusiastically agreed to use it but has become an invaluable source of feedback as we work to improve the product constantly. In the early days, they were exceptionally patient with us as we worked out kinks in the user experience, and were communicative about bugs and best practices. (After all, they’ve used every annotation tool under the sun.) Having a partner who is willing to invest the time and energy to help us improve our tooling has been an unexpected value that we now can’t imagine doing without.

screen showing annotate tool — Screenshot showing Azavea’s annotation tool for geospatial imagery

3. Consultative approach – Surprisingly, CloudFactory was the only data labeling service we spoke with that seriously challenged our assumptions about how we should organize labeling work. Originally, we had asked for several annotators to be dedicated to labeling 40 hours per week based on a projection of the productivity we saw from our own in-house work. However, CloudFactory advised that for our task, the best structure would be to have a bigger team of part-time labelers, based on their experience combating task fatigue and producing better quality work with our use cases.

It made us question why we hadn’t gotten any pushback from other groups – after all, we aren’t the data labeling experts, they are!

4. Price – CloudFactory’s pricing was competitive – not the cheapest but not the most expensive. We knew from the outset that we weren’t trying to find the absolute cheapest option, but rather the highest quality option we could afford to sustain. We liked the visibility we had into how pricing would scale as our commitment grew over time. Some firms we spoke with priced based on the number of annotations, which we found unusual, given that for some projects that required precise, complicated annotations we expected 10-15 minutes of work per image, while other simple tasks could be completed in under 30 seconds. CloudFactory priced based on hours allotted across the entire team, which made more sense – at least it was a controlled variable and therefore something we could trust would hold up to the inevitable edge cases that would come up as we worked on new and different projects over time.

When outsourcing data labeling, consider the long term

As you consider outsourcing training data generation for your machine learning work, I would encourage you to think about the long term. At Azavea, we believe the value proposition of machine learning is not to produce short-term gain using incremental automation. Instead, the companies that will reap the most value from adopting machine learning techniques will be those who view ML models as assets that get more valuable over time, rather than something that depreciates the moment it’s put into production.

If you structure an ML-related initiative correctly, where model predictions are fed back into the labeling and validation WorkStream you’ve created, then you have an opportunity to constantly recalibrate the models you are using over time and, in theory, compound the effectiveness (and therefore, the value) of both the models you’re using and the underlying data you’re using to train them. If that’s your orientation, then a transient marketplace of part-time annotators who aren’t treated like trusted partners makes no sense. Rather, you should look for a committed, durable relationship that can grow and mature alongside the virtuous loops you’re designing into your ML software.

Are you interested in partnering with a data labeling team to optimize your machine learning work? Read Niki’s blog on the lessons we learned in doing so. Or drop us a line to talk more about it.

Community Raster Vision