General Purpose Computing on the Graphics Processing Unit, or GPGPU, is a technology that enables software to use the computing power of the multiple processing units that come with modern graphics cards. The benefits of using these processing units is that computational work gets done by many workers at once, instead of one CPU or a few threads on a multi-core CPU.
At first, it may sound like multi-threading, and in a way it is. What’s key, though, is that the GPU doesn’t just multi-thread, it’s synchonized, lock stepped multi-threading. It’s called Single Instruction, Multiple Data, or SIMD in Flynn’s Taxonomy (NVidia’s hardware uses a variant called Single Instruction, Multiple Threads, or SIMT). This refers to the control flow in each thread on the GPU — to maximize performance, it’s important to get all threads running through the same logic at the same time. In practice, some threads will eventually branch, and keeping the branching to a minimum is critical.
When we have processing tasks that can be highly discrete, the SIMT nature of the GPU enables computations to really fly when using GPGPU. Structuring computations in this fashion will result in operations that are highly parallelized. Coincidentally, they will also be easily distributed to other computation engines. But unlike other distributed computing methods, GPGPU is tightly coupled, and has very low latency — you get immediate results.
To facilitate all of above, GPGPU requires a video card that supports programmability. While the major GPU manufacturers have released proprietary GPGPU languages — for example is NVidia’s CUDA framework — a cross-platform, device-independent alternative, called OpenCL, has recently become sufficiently robust that we can begin to use it for implementing GPGPU processing. While originally developed by Apple, OpenCL is now overseen by a non-profit organization, the Khronos Group.
If you’ve got a GPU on hand, it’s quite likely that there is an OpenCL API available in your programming language of choice. Here’s a short list from the Khronos Group’s website:
By the way, Khronos just released the spec for OpenCL 1.1 on June 14th, 2010.
GPGPU for GIS
Our interest in GPGPU is motivated by our desire to speed up complicated raster math calculations, in particular those that would enable us to make more sophisticated geoprocessing available on web and mobile platforms. Our vision is to make an order of magnitude or better improvements in geospatial data processing and thereby make web-based analysis tools more responsive and compelling. To support our research we applied for and were awarded a National Science Foundation (NSF) Small Business Innovation Research (SBIR) grant to benchmark some traditional GIS operations against GPU-accelerated versions of the same operations. Our objective is to improve these operations by about 20x, and it looks promising.
We’re sampling the wide array of Map Algebra operations that are in standard GIS toolkits, focusing on a couple examples of each of the following types of operations: local, focal, zonal, and global. Some of these types of operations are easily parallelized, such as local and focal operations. The other operations require a fair amount of algorithm wrangling before they show any improvement on the GPU’s parallel architecture.
Our research is ongoing, so it’s still a bit too early to tell, but we’re excited to already see some impressive performance improvements.
Migrating our chosen Map Algebra operations to the GPU was trivial in some cases, but quite challenging in others. For example, local Map Algebra operations are quite easy to parallelize. Each raster cell is compared to another raster cell in the same location. This requires very little extra memory and few intermediate steps.
Focal (or “neighborhood”) operations are not simple, particularly for large neighborhoods, but with some intelligent partitioning, it’s relatively easy to work with different sections of the rasters. This requires some intermediate buffers, but performs well using local memory on the GPU.
Zonal operations summarize the cells in one raster data set using zones identified in a second one. These operations require a couple scans across a raster, and a few intermediate stages — that one was complicated.
What’s crazy difficult, though, are the global operations. These operations are challenging because they operate on the entire dataset, with one cell on one side of the raster potentially affecting a cell on the other side of the raster. Examples of global operations include Euclidean distance, cost-weighted distance, viewshed analysis and others. These operations are non-trivial to convert to GPU processing in a performant way, but it is definitely possible.