Detection and removal of clouds in satellite imagery is a natural application of machine learning, but freely available training datasets have been in short supply. To address this, Azavea has produced — and hereby announces the immediate public release of — a dataset consisting of 32 unique Sentinel-2 tiles with cloud labels produced by humans. The 32 unique tiles are present in L1C (top of atmosphere) and L2A (surface reflectance) versions, they cover 25 unique locations, and all four seasons are represented in the dataset. As we discussed in an earlier blog post, we have been able to use these data to produce some interesting results, but now it is your turn!
The data can be found in a publicly accessible, requester pays S3 bucket:
There are three files associated with every tile in the dataset: a
catalog.zip file (a STAC archive containing the vector labels), a
*L1C-0.tif file (a GeoTiff file containing an L1C version of the tile), and a
*L2A-0.tif file (a GeoTiff file containing an L2A version of the tile). The list of tiles in the dataset can be found on GitHub (that file contains the locations of the
catalog.zip files and the
*L1C-0.tif files, the locations of the L2A files are deducible from the locations of the L1C files).
The total size of the dataset is about 80GB.
The GeoTiff format is venerable and well-known so it does not require further discussion, but the STAC format used for the vector labels might benefit from a few re-introductory words.
The STAC specification is an effort to produce a lingua franca for geospatial data to facilitate interoperability, indexability, discoverability, &c. The STAC format has been much written about on the Azavea blog (for example here, here, here, and here), and the curious reader is encouraged to read those posts for additional background. STAC archives can contain pointers to raster data, vector data, and many other things.
The STAC archives that are provided in this dataset do not include the location of the underlying Sentinel-2 imagery, but instead, the locations of those files must be taken from the file
catalogs.json provided on GitHub (as previously mentioned).
In this dataset, the GeoJSON vector labels for each Sentinel-2 tile are stored in the associated STAC. The location of the vector label information within each tree-like STAC archive can be located by reference to the following information.
The sole child of the root of each STAC is a
Layer Collection and that
Layer Collection contains an
Image Collection and a
Label Collection. The
Image Collection contains a single item that can be safely ignored because the location of the imagery is given separately, as previously mentioned. The
Label Collection also contains a single
Item, and that item contains the GeoJSON label data.
The reader unfamiliar with STAC is encouraged to use PySTAC in a Python REPL to explore the contents of one or more of the STACs in this dataset. A fully-working example of how to parse these STACs using PySTAC can be found on GitHub.
You can begin to experiment with the dataset immediately by cloning the Azavea Cloud Model GitHub repository.
It is assumed that the reader has an AWS account and that Docker is properly configured and working locally.
To set up the necessary AWS Batch resources, the instructions given in the Raster Vision documentation can be used as a starting point. After the resources have been created, it is necessary to increase the amount of storage available to GPU instances. This is necessary to allow the dataset to fit within the local storage of a GPU instance. The increase in available local storage can be accomplished by adding a launch template to the GPU compute environment that is used to generate GPU instances on Batch. Instructions for how to augment the storage size can be found here. We recommend increasing the local storage to 512GB.
After that, it should be possible to follow the instructions given in the README.md file in the repository.
Given the rasters and STACs in the dataset, you can modify our code or write/use your own!