3 Tips to Optimize Your Machine Learning Project for Data Labeling

3 Tips to Optimize Your Machine Learning Project for Data Labeling

One lesson we at Raster Foundry have learned in labeling satellite imagery for machine learning is to involve your software engineer early. Designing labeling projects without substantial discussion before-hand was an early setback we encountered. Our assumptions often made the work harder than it needed to be. In one case, we were asking a level of granularity from our data labelers that was far greater than the model engineer required. 

An animated photo scrolling down a Google doc with fields for members of the semantic segmentation labeling and engineering teams to fill out.
Our project design template helps us manage the labeling process and encourages communication between all members of the team. 

One improvement we implemented was a project description template. Before creating a project, project managers, software engineers, clients, and the data labeling team work together to complete it. The document helps clarify everyone’s expectations and parameters.

Involving the engineering team has (retrospectively in our case) obvious benefits, but consulting with the annotators beforehand is equally as important. As the ones with the most “hands-on” experience, their insights are invaluable to the entire machine learning team. I recently asked our CloudFactory (CF) labeling team to share their thoughts on project design. Here’s what they had to say. 

Tip #1: Use as few classes as possible

When we began creating machine learning projects, we often threw in as many classes as we thought might be useful: “If we can see clouds, why not separate it from ‘Background’? Might come in handy one day.” We soon learned that this was not the wisest of approaches, and our CF team concurs. 

“…more than two classifications were…difficult to label and also consumed more time and effort”

Prashant Maharjan

Prashant Maharjan let us know that it is “really tough” to label “five or more types” in a given task. Maharjan also points out that such tasks not only take more time, but reduce accuracy. Our data labelers recommend no more than three classes if possible. Add any more, and you’ll be increasing the time needed to label, decreasing your label quality, and draining your labelers mentally. Of course, there will be use-cases that do need more than three classes, but it’s worth considering before creating your project. 

Tip #2: Choose a tool whose features meet your needs

Many on the team brought up the importance of the software application used to label. We relied on CF’s feedback and thoughts as we developed GroundWork, our annotation tool. And, as it turns out, they appreciate our efforts! 

The letters "G" and "W" form part of a hot-air balloon in orange. In the center is the text GroundWork in black.
GroundWork is the first annotation tool for geospatial data.

Our team lead, Smita Shrestha, credits GroundWork’s “continuous improvement and additional features” with increased productivity on the data labeling team. If you are using Sentinel-2 imagery for your machine learning project, one feature our labelers find helpful is the ability to overlay your imagery with false color composites. These composites highlight specific types of features like water or vegetation. Chris Brown, Technical Lead of Raster Foundry, used these overlays to help distinguish various features in Sentinel-2 imagery.  

Tip #3: Encourage decision making at the labeler level

As anyone who labels satellite imagery for machine learning soon discovers, the number of edge cases, questionable identifications, and hazy boundaries is endless. If your data labeling team needs you to answer every question, you’ll find yourself facing near constant work stoppages. CloudFactory’s commitment to developing leadership skills in their workforce allows us to entrust our team to make thoughtful decisions. Devyanee Neupane confirms that her teammates are the first people she teams up with “to find out the proper solution.”

A screenshot a Zoom conference. A three by three grid shows 9 team members.
Our CF colleagues at a recent labeling team meeting. (Photo: CloudFactory)

Empowering our CF colleagues has always been a part of our workflow since they are based in Nepal and we are in Philadelphia. Now that many of us are working from home and using distributed hours to juggle child care, increased housework, and maintaining some sense of normalcy, the ability to rely on your labelers’ decision-making skills is even more critical.

Giving your data meaning 

Even if you are a labeling team of one, it is important to consider your machine learning project from the data labeler’s perspective before you create it. After all, your labeling team members are the ones who will give your data meaning. If you are working with professional labelers, mining their knowledge and expertise will only heighten your data quality. In our case, it also improved our labeling tool and helped us launch (what we think) is a pretty awesome product. Don’t hesitate to reach out to us about how you can work with GroundWork, join our labeling workflow, or both!