LocationTech GeoTrellis is a library that enables low latency, distributed processing of geospatial data, particularly for imagery and other raster data. It depends on Apache Spark to leverage the power of distributed computation.
There has been significant progress since the 1.1 release in June. The release in September of the feature-rich Python binding project called GeoPySpark marks an important milestone: increasing the developer user base that can utilize GeoTrellis. Additionally, we have seen new contributors, new features, and improved documentation. We are excited by the expanding community and look forward to welcoming more users.
Python Bindings: GeoPySpark
Although GeoPySpark has its own release schedule and recently released version 0.3.0, the two projects are closely linked. We did not anticipate the two-way benefits of creating GeoPySpark at the beginning of the project. GeoPySpark provides access to a subset of core GeoTrellis features. In the process of creating the GeoPySpark API to be as succinct as possible, we discovered approaches that would make the core GeoTrellis API more direct. Two examples: (1) the current GeoTrellis `Rasterizer` was developed in GeoPySpark before being brought into GeoTrellis and (2) the GeoTrellis tiling and reproject API/logic is being incorporated into the new GeoTrellis approach.In addition to expanding the user base through Python support, GeoPySpark was designed to facilitate iterative workflows in a Jupyter Notebook. GeoPySpark can be installed using ‘pip’ or accessed through a docker container that includes all of the necessary dependencies. Users can develop workflows and iterate on algorithmic design on a single machine in a Jupyter Notebook environment and easily scale the workflow to a national or global dataset by leveraging cluster computing. Azavea worked with the team at Kitware that created GeoNotebook – a tool that provides an embedded map inside of a Jupyter Notebook – to develop interactive workflows.
I am highlighting a few of the major new features. Find the complete list of API changes, new features, and bug fixes in the changelog.
Rasterizing Geometry Layers
Finally, the full marriage of the vector, raster, and spark packages! You can now transform an RDD[Geometry] into a writable GeoTrellis layer of (SpatialKey, Tile)!
val geoms: RDD[Geometry] = ... val celltype: CellType = ... val layout: LayoutDefinition = ... val value: Double = ... /* Value to fill the intersecting pixels with */ val layer: RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = geoms.rasterize(value, celltype, layout)
Rasterization is the process of converting vector data to raster data. Our first use case was the creation of cost distance layer based on a friction surface that includes roads of a friction surface using roads to build a layer for supporting a cost distance function for estimating travelsheds.
Clipping Geometry Layers to a Grid
In a similar vein to the above, you can now transform an arbitrarily large collection of Geometries into a proper GeoTrellis layer, where the sections of each Geometry are clipped to fit inside their enclosing Extents.
Here we can see a large Line being clipped into nine sublines. It’s one method call:
import geotrellis.spark._val layout: LayoutDefinition = ... /* The definition of your grid */ val geoms: RDD[Geometry] = ... /* Result of some previous work */ /* There are likely many clipped Geometries per SpatialKey... */ val layer: RDD[(SpatialKey, Geometry)] = geoms.clipToGrid(layout) /* ... so we can group them! */ val grouped: RDD[(SpatialKey, Iterable[Geometry])] = layer.groupByKey
If clipping on the Extent boundaries is not what you want, there are ways to customize this. See the ClipToGrid entry in our Scaladocs.
Sparkified Viewshed and Euclidean Distance
Viewshed for tiles was introduced in GeoTrellis 0.10.0 and Euclidean distance became available in GeoTrellis 1.0. GeoTrellis 1.2 brings both to the distributed Spark environment. Prior to GeoTrellis 1.2 this was possible at the individual Tile level but not the Layer (RDD) level. Now it is. This supports the ability to run the operations on arbitrarily large datasets.
An example code snippet for Euclidean Distance:
/* Result of previous work. Potentially millions of points per SpatialKey. */ val points: RDD[(SpatialKey, Array[Coordinate])] = ... val layout: LayoutDefinition = ... /* The definition of your grid */ val layer: RDD[(SpatialKey, Tile)] = points.euclideanDistance(layout)
Projects Using GeoTrellis
Here is a highlight of two projects that are using GeoTrellis
RasterFrames has been a fruitful collaboration that has led to good feedback for GeoTrellis in addition to many useful contributions to the core library. RasterFrames is being developed by a company called Astraea, it brings the power of Spark DataFrames to geospatial raster data. It leverages GeoTrellis’ map algebra and tile layer operations. This is definitely an exciting project to look into, particularly for data scientists with knowledge of data frames who want to explore large geospatial datasets.
Raster Foundry is a web application being developed by Azavea that is built on top of GeoTrellis functionality. It focuses on providing an intuitive user interface and easy-to-follow workflows. It enables trained geospatial professionals in addition to non-technical users access to advanced geospatial workflows and analytics on large datasets.
With 1.2 released, we are focusing efforts on the 2.0 release that we anticipate for the first half of 2018. We believe that GeoTrellis contains much of the core functionality we envisioned for the project. This does not mean it is done or feature complete, and much work remains to make GeoTrellis more usable and accessible. GeoPySpark is a good start for reaching Python developers, but we still need to continue improving the documentation, tutorials, and demonstrations for both core GeoTrellis and GeoPySpark. There are also several features and optimizations that are on the roadmap for the coming year:
- Cloud Optimized GeoTiffs (COGs) is a format for internal organization of GeoTiff files that enables efficient retrieval of data subsets in cloud workflows. We plan to make COGs the standard format for storing GeoTrellis layers. Development has begun, read about the plans here
- Bringing in additional features from GeoTrellis into GeoPySpark
- Map Algebra Modeling Language (MAML) is a declarative structure that describes a sequences of map algebra operations. This structure can be evaluated against a given collection of datasets to compute a result. Critically the evaluation logic is not specified in MAML, only the semantic meaning of the operations. This separation allows for multiple interpreters to exist that operate in different computational contexts. This has the potential to expose GeoTrellis processing power to custom interpreters
- Improve GeoTrellis job performance and reduce resource requirements by optimizing data access patterns based on query and job structure
- Improve query performance for large spatiotemporal layers by implementing more advanced space filling curve indexing techniques, based on the approach used in GeoWave
- Expose support for WCS OGC standard
- Gitter (The fastest way to get an answer about GeoTrellis)
- Twitter (Track updates and share what you are working on!)
- GeoTrellis on GitHub
- GeoTrellis documentation
- GeoPySpark on GitHub
- RasterFoundry website
- Raster Foundry on GitHub