The Future of Machine Learning AMA

Last month, we asked readers and our social media followers what questions they had for our machine learning engineers Lewis Fishgold, James McClain, and Rob Emanuele. We received a variety of questions that covered different topics within the world of geospatial machine learning. We covered questions about getting started with machine learning in Part 1. This post will cover the future of machine learning!

We’re machine learning engineers. Ask us anything!

Lewis in a collared shirt in front of office artwork — Lewis Fishgold

What are the top 3 challenges for geospatial ML pipelines?

(Chinmay @ChinmayIO)

JM: Here are three that I can easily think of:

Obtaining/curating high-quality training data.
Dealing with hardware limitations (e.g. do you have enough video RAM to have sufficiently=large training batches for your learning algorithm to make progress)
Hyper-parameter search

None of these three things are unique to the geospatial domain.

RE: For geospatial ML pipelines in particular, a few come to mind:

Utilizing multispectral raster data, which is prevalent for remote sensing applications. This was a big challenge for us, but we’ve now incorporated this into our ML pipelines such that we can train and predict with models that utilize all available spectral bands. We can even take advantage of pre-trained models that only use RGB, which comes out of some great work by our Open Source Fellow Adeel Hassan last year.
The size and complicated nature of geospatial data that I talk about in my answer to the impediment question above.
Adoption – there are so many use cases that ML can apply to for geospatial data, but many organizations that are processing geospatial data are still doing so on bulky, proprietary desktop software systems. We’re seeing a lot of innovation, with organizations moving their geospatial workflows up to the cloud and leveraging the vast open source ecosystem. Leaders in this space are asking the right questions about how to apply ML to their geospatial data processing pipelines and workflows, and progress is being made, but there is still quite a bit of lag between what can be done and what is being implemented in the industry.

LF: I’m not sure that these are the top 3, but here are some challenges to complement the above answers.

Existing labeling tools are not well designed for geospatial use cases. Some features that we need include:
- The ability to split very large images into a set of smaller labeling tasks which can be individually tracked and worked on in parallel
- Importing and exporting geospatial data formats
- Overlays and layer selectors for difference basemaps
- Support for viewing time series of imagery
  We have been building our own geospatial labeling tool which we hope to release soon- stay tuned!
Sometimes we need to detect objects that vary greatly in size. For example, we did a project where we used drone imagery to detect water towers and rust patches on them. The images are much too large to feed into a neural network all at once, so we resort to chopping them into smaller chips. Although we can detect the rust patches in these narrow glimpses, it can be difficult to see the difference between a patch of sky, or a patch of a light blue water tower without additional spatial context. A more ideal approach would be to process the images at multiple resolutions, but this would add additional complexity and resource consumption to the pipeline.
The field of computer vision has mostly focused on “dogs and cats” datasets such as Imagenet and COCO. A lot of resources have been expended in developing models and open source tools that perform well on these datasets. These tools have been adopted without much modification by the geospatial community, but at the expense of being well optimized for our use case. For instance, off-the-shelf libraries often expect images to be in the form of PNG files, but this limits our ability to train on multispectral images. We are then left with a choice to ignore additional imagery bands and forgo potential accuracy improvements, invest resources into improving these libraries, or write our own libraries from scratch.

What aspect of the geo-ML process is the biggest impediment to developing and refining models?

(Simeon Fitch @metasim)

LF: See my answer to the top 3 challenges for geospatial ML pipelines.

JM: One answer that comes immediately to mind for me is “occlusions”.

It is not infrequent for the objects of interest in our imagery to be obscured, frequently by clouds. This can sometimes be mitigated by looking at other imagery, taken of the same place around the same time, and using that to patch things up, but nonetheless the problem exists.

I am sure that analogous problems occur in many/any real-world application of machine learning-based computer vision.

RE: I think an example of a big impediment is one that the geospatial community faces a lot: geodata can be big, heavy, complicated data that is difficult to process and put into the forms that ML processes need to do their work. There’s a lack of tooling around creating geospatial-specific training data and ML processes that deal with geospatial inputs and outputs as a first-class data product. That’s why we’re spending so much time focusing on tooling such as Raster Vision and our geospatial imagery annotation tooling (stay tuned for a public release!). By enabling ML engineers to work in the geospatial context, and for geospatial experts to work in the machine learning domain, we’re trying to bridge a lot of those tricky parts that have in the past kept geospatial machine learning a step or two behind the more general ML technology.

How do you define “better training data”? Is higher resolution data better?

(Mark Craddock @mcraddock)

JM: Higher resolution data take more time and resources to manage.

Setting that negative aside, higher-resolution data are potentially better because one always retains the option of reducing the resolution (spatial, temporal, et cetera) if it is not useful, but the reverse is not easy.

Roughly, I would say that “better” training data are more carefully and plentifully labeled. The issue of “plenty” was briefly touched upon in an earlier answer. The issue of “care” is more than just how carefully mudflats have been outlined, but whether the correct kind of mudflats have been outlined, and whether mudflats should have been outlined at all.

RE: It depends on what you want the model to work against. If you are creating a model to do continent-scale predictions using Sentinel 2 imagery, then you’ll need good training data that is at 10m resolution. Now, if you’re building models that can leverage something like drone imagery, e.g. the type of imagery that’s part of the Open Cities AI Challenge, there’s a lot you can expect the model to pick out from that high-resolution imagery that just isn’t available to lower resolution models. Of course, you are trading off for coverage – but that’s another question.

“Better” training data to me means some of the following:

The data is consistent. This means it all has predictable and sensible properties like the Coordinate Reference System, Ground Sample Distance, NoData values, etc. All of the label data is consistent. We’ve put together a number of training datasets as a service (including the Open Cities AI Challenge), and a lot of work goes into making sure the data is of a clean and consistent format so that machine learning engineers don’t have to spend their time tracking down data issues.
The data is cataloged in a clear way that allows ML engineers to be able to take subsets of the data and split the data without having to work too hard. This is where we are big believers in SpatioTemporal Asset Catalogs (STACs), with the label extensions for cataloging geospatial training data. You’ll notice that all the Open Cities AI Challenge data is available in STAC.
The label data is good. That means the data is true, validated ground truth. In our annotation workflows, we always have at least two people on our data team review labels so that we can be sure of the quality.
It is representative of the target domain. Training data that is not diverse enough to represent the distribution of properties, in reality, will inextricably contain bias, which it then passes to the model. I can’t train a model on European cities and expect it to work in rural Mexico. The training data needs to be clear about what it is representative of (e.g. train a model on European cities, but know the model should only ever be applied to that same context), or the training data needs to be representative of the broader reality (e.g. include training data of similar visual, ecological and societal properties as the rural Mexican areas you will infer on).

Will we get to a point where we don’t need so much training data? Or is that just something else?

(Azavea’s own Ross Bernet @rosszb)

JM: This is a philosophical question.

If you are a strict materialist, then in principle it should be possible (since humans seem to be able to learn certain types of skills and concepts without much “training data”). Given that, you can ask whence that ability comes: the experiences of a single lifetime (of some given person) or accumulated experience that has been transferred to a person in the form of inherited instinct, physiological proclivity, et cetera. (Which one or ones of those obtain will have implications for the amount and type of resources that might be required to accomplish the goal.)

If you are not a strict materialist then this question is much more wide-open. Also, congratulations on not being a strict materialist.

RE: I think we’re seeing transfer learning play a huge role in reducing the amount of training data needed to create effective models. I’ve seen models pre-trained on SpaceNet 30cm satellite imagery that were fine-tuned on a smaller set of high-resolution drone imagery perform surprisingly well for road and building extraction. Techniques like unsupervised learning like autoencoders or tile2vec can help a model learn the features of vast unlabeled data before being tuned to a supervised learning technique. So there are already mechanisms to try and get around not having enough (supervised) training data, that I think will continue to improve in the future. However, if you want to build a model for a specific task that has a high level of accuracy, the most valuable asset for this is quality training data – I’m not sure what leap in technology will make that not true!

LF: Aside from transfer learning and unsupervised learning, another path to reducing labeling effort is using synthetic training data. Recently, the Synthinel dataset was published which contains synthetic aerial imagery of urban scenes along with building labels, generated from a 3D video game engine. By augmenting the original real dataset with the synthetic dataset, generalization on real imagery was improved. However, simulated data doesn’t contain the same detail and variation as the real world, so there will always be a need for some real-world data.

Have you given careful thought to negative long-term implications and potential abuses of ML approaches to geospatial analysis?

(KT Snicket @KtTemp)

JM: We are aware of the dual-use nature of machine learning. We try to focus ourselves on positive uses, consistent with Azavea’s values.

LF: To add to what James said, we have a set of ethical guidelines for project selection at Azavea, which were described in a blog post by our CEO. Two rules that are relevant here are: “We do not work on weapons systems, warfighting, or activities that will violate privacy, human rights, or civil liberties”, and “We do not work to support the expansion of fossil fuel extraction”.

RE: I can’t characterize the thought I’ve put into this as careful, because I’m not sure anyone is putting careful enough thought into this. In my opinion, where we are at in applying AI/ML to geospatial analysis is still in the phase where we are asking “what can we do?” An important question that we need to face earlier rather than later is “what should we do?” As a B Corporation, Azavea seeks to apply technology for civic, social and environmental impact. That means we take on projects that generally have a positive impact in the world.

The trick is, even when you aim to have a positive impact, there are still ethical questions around AI that can be tricky. For instance, in a disaster response scenario, if AI is being used to find damaged buildings for directing first responders, what happens if that AI is better able to pick out damage from buildings in higher-cost areas? Are we biasing first response towards populations with higher economic status?

The nature of applying AI to decision making when the stakes are that high is something we as a society have to be very careful about. These questions do come up for us, and we try to think through ways that our ML might be misused, either accidentally or with ill intent, and do our best to mitigate that risk. However, as I said, there’s always more to be done.

Currently, Azavea is participating in a Responsible AI for Disaster Risk Management Working Group that combines technology and disaster relief practitioners to discuss the ethical issues on how AI can and will be applied in these scenarios. Our colleague Niki Lagrone has written a series of thought-provoking blog posts around this subject. As we continue to build out the technologies that help us apply ML to geospatial analysis, we will also continue to listen, learn and raise our voice in the conversation of how to ethically apply these technologies to maximize their benefit and try our best to mitigate the risk that they present.

Thank you to everyone who submitted questions! Did you enjoy this Ask Me Anything blog? Want to see more of them? Submit your ideas and questions here or on Twitter.

Machine Learning