Azavea Labs

Where software engineering meets GIS.

GPU Occupancy and Idling

As our ongoing research into raster processing for GIS on the GPU progresses, we have gone through various stages in the development of each Map Algebra operation.  Having converted a given operation to the GPU, we are finding that there are many potential ways to optimize, and this optimization process brings with it a host of issues that highlight the differences between sequential CPU programming and GPGPU parallel programming.

During the optimization process, we’ve found (and been told) that the single most important optimization is to ensure memory coalescence.  I blogged about that before, so if you haven’t seen it yet, it might be worth reading before you continue on.

After maximum memory coalescence has been achieved, it is possible to focus on 2 additional metrics: occupancy and idling.


The occupancy metric is defined as the number of active thread groups per processor divided by the maximum number of thread groups per processor.  It’s a value in the range of 0-100%.

Occupancy is the number of thread groups (NVidia calls them ‘warps’, ATI calls them ‘wavefronts’) that are active at one time.  At any one time, some thread groups may be processing data, and some thread groups may be accessing global memory.  When some thread groups are accessing global memory, these threads are effectively stalled for hundreds of instructions, while the other thread groups continue on.

Internally, the GPU has a thread group scheduler which controls when thread groups are executed. This is extremely useful, since highly parallel operations will utilize many thread groups to perform calculations. The GPU is highly parallel, but even it has its limits. This is where the thread group scheduler comes in — it can execute some of the thread groups, while other thread groups are idle, either completed or queued. This scheduling enables some thread groups to perform memory access, while other thread groups perform calculations.

Understanding the scheduler makes it possible to ‘hide’ these global memory accesses by performing ~100 arithmetic instructions between each global memory access.  Hypothetically, if the GPU executed a kernel that accessed global memory, performed a heavy-duty calculation, then saved that result, the occupancy would probably be pretty high. The thread group scheduler would schedule a set of thread groups for accessing global memory while scheduling another set of thread groups for heavy-duty calculation. This is effectively ‘hiding’ the memory access, since the GPU can perform computation instructions while accessing memory. Interestingly, there will be a point when increases to occupancy won’t improve your performance. It is at this point when all global memory accesses are ‘hidden’ by the computation, and it becomes time to look other places for optimization.


The idling metric is defined as the amount of time the GPU is idle divided by the overall execution time of the computation.  It’s a value in the range of 0-100%.

Idling is something that we have discovered to be critical to the performance of a calculation.  The reference and training documentation instructs GPGPU developers to keep the GPU as busy as possible for as long as possible, and stops there.  By creating this metric, we were able to measure just how much this idling was affecting our computation.

As it turns out, our initial experiments showed that our GPU was idle during periods of memory transfer to and from the CPU.  This idling of the GPU was extending the overall time for computation.  Minimizing this idling through asynchronous kernel execution and memory transfer resulted in a significant and immediate performance improvement.

Coalescence, Occupancy, Idling

To summarize, the best way to optimize your GPU computations is to investigate and optimize these three steps (and in this order):

  1. Memory coalescence
  2. Thread group occupancy
  3. GPU Idling

There are a number of smaller optimization that can be done as well, but we’ve found these to be the big 3.  Of course, you can continue this process forever, and demonstrate to your boss the law of diminishing returns.

GPUs and Parallel Computing Architectures

I’ve been blogging about GPUs recently, and I think you can tell it’s because I’m excited about the technology.  General Purpose Computing on the GPU (GPGPU) promises great performance increases in computationally heavy software, which we find immensely useful.  In the past, we’ve managed to engineer web-based applications (see: SmartConservation) that could run complex models by implementing a process queuing architecture, but in these systems, while they will run on the web, processing may still take several minutes and they therefore can neither provide a responsive user experience nor support many users.  We’ve also engineered a system that can perform fast, distributed raster calculations (see: Walkshed, powered by DecisionTree).

One of the reasons that GPGPU is so promising is the increasing number of processing cores available on affordable graphics cards.  This increases the computation capacity by leveraging many processors running in parallel. What’s interesting is that this technique is not new.  Timothy Mattson, blogging at Intel, has been doing this since the mid 80′s.  The Library of Congress contains a book on parallel computing structures and algorithms, dating back to 1969.

As we delve deeper into our work improving Map Algebra operations, important differences in algorithm approaches and implementations become apparent: not all parallel architectures are the same.  One might be tempted to think that when switching from the single-threaded CPU logic to multithreaded/parallel logic that there would be one model of parallel computing that is universal.  This is definitely not the case.

Three of the most popular types of parallel computing today are:

  • Shared-memory Multi-Processors (SMP)
  • Distributed-memory Massive Parallel Processors (MPP)
  • Cluster computing

Each type of parallel computing has its benefits and drawbacks.  It really just depends what kind of computing you need to do. I’ll describe these common computing types in detail, starting with the ‘traditional’ CPU model.


CUDA, Stream, and OpenCL

Computing on the GPU, or GPGPU, is a steadily maturing technology.  There are many technologies out in the wild that will enable you to use GPU’s for computation, but there’s a catch: the vendors are still vying for the lead.  The two market leaders are currently NVidia and AMD/ATI.

That means that NVidia is pushing their GPGPU API, which is named Compute Unified Device Architecture,” or CUDA.  Their rival, AMD/ATI, is pushing Stream. Stream incorporates BrookGPU, a compiler and data-parallel language developed at Stanford University, which predates CUDA.

NVidia & CUDA, ATI & Stream, or OpenCL

NVidia & CUDA, ATI & Stream, or OpenCL

Both of these vendor APIs are proprietary, and run on each vendor’s specific hardware.  This makes sense if a developer can control what hardware computations will be using. Realistically, a developer rarely has such control. So what are the options? At the current time, there are only a couple:  OpenCL and Microsoft’s DirectCompute technology.  Microsoft’s technology is limited to Windows Vista and Windows 7, though, so we are focusing on OpenCL.

OpenCL is the Open Computing Language, a language that extends the C99 standard (a modern dialect of the C programming language) and compiles into device-specific binaries. OpenCL was originally developed by Apple, and handed over to the Khronos Group. The OpenCL standard was ratified by the consortium in December of 2008.  The Khronos Group consortium includes all the major players in the field, including NVidia, AMD/ATI, Apple, and Intel.  The list is much more extensive, but those are the four to be happy about.  Intel doesn’t support OpenCL in their multicore CPUs, but I’m optimistic that they will release an OpenCL API to leverage CPU cores as well as GPU cores as computing devices.

OpenCL was created to address the need for speed in current desktop systems that contain GPU processors.  The language was created to address computing on heterogeneous systems, which, when you think about it, can include many other types of computing devices.  If OpenCL is adopted by Android, then you could optimize code to run on Android devices, too. While this may not be the fastest approach, it would potentially let you distribute work among devices.

One caveat to heterogeneous systems, though: OpenCL kernels that are written and optimized for one hardware platform probably won’t perform the same as on another hardware platform.  While OpenCL enables developers to write code that can run on multiple hardware devices, the hardware implementations may vary.  For example, the number of processor cores, and thus the number of parallel threads may vary widely.

If you can’t tell already, we are sold on the promise of OpenCL for GPGPU.  The language is easy to use (if you already know C), and it supports the two biggest players in the GPU market, NVidia and AMD/ATI.  We are hoping that Intel releases their OpenCL drivers for CPUs, too, so that we can squeeze out the last drip of computing power for our computations.

What the heck is … GPGPU?

General Purpose Computing on the Graphics Processing Unit, or GPGPU, is a technology that enables software to use the computing power of the multiple processing units that come with modern graphics cards.  The benefits of using these processing units is that computational work gets done by many workers at once, instead of one CPU or a few threads on a multi-core CPU.

At first, it may sound like multi-threading, and in a way it is.  What’s key, though, is that the GPU doesn’t just multi-thread, it’s synchonized, lock stepped multi-threading.   It’s called Single Instruction, Multiple Data, or SIMD in Flynn’s Taxonomy (NVidia’s hardware uses a variant called Single Instruction, Multiple Threads, or SIMT).  This refers to the control flow in each thread on the GPU — to maximize performance, it’s important to get all threads running through the same logic at the same time.  In practice, some threads will eventually branch, and keeping the branching to a minimum is critical.

When we have processing tasks that can be highly discrete, the SIMT nature of the GPU enables computations to really fly when using GPGPU.  Structuring computations in this fashion will result in operations that are highly parallelized.  Coincidentally, they will also be easily distributed to other computation engines.  But unlike other distributed computing methods, GPGPU is tightly coupled, and has very low latency — you get immediate results.

OpenCL Logo

To facilitate all of above, GPGPU requires a video card that supports programmability.  While the major GPU manufacturers have released proprietary GPGPU languages — for example is NVidia’s CUDA framework — a cross-platform, device-independent alternative, called OpenCL, has recently become sufficiently robust that we can begin to use it for implementing GPGPU processing.  While originally developed by Apple, OpenCL is now overseen by a non-profit organization, the Khronos Group.

If you’ve got a GPU on hand, it’s quite likely that there is an OpenCL API available in your programming language of choice.  Here’s a short list from the Khronos Group’s website:

By the way, Khronos just released the spec for OpenCL 1.1 on June 14th, 2010.


Our interest in GPGPU is motivated by our desire to speed up complicated raster math calculations, in particular those that would enable us to make more sophisticated geoprocessing available on web and mobile platforms.  Our vision is to make an order of magnitude or better improvements in geospatial data processing and thereby make web-based analysis tools more responsive and compelling.  To support our research we applied for and were awarded a National Science Foundation (NSF) Small Business Innovation Research (SBIR) grant to benchmark some traditional GIS operations against  GPU-accelerated versions of the same operations.  Our objective is to improve these operations by about 20x, and it looks promising.

We’re sampling the wide array of Map Algebra operations that are in standard GIS toolkits, focusing on a couple examples of each of the following types of operations: local, focal, zonal, and global.  Some of these types of operations are easily parallelized, such as local and focal operations.  The other operations require a fair amount of algorithm wrangling before they show any improvement on the GPU’s parallel architecture.

Our research is ongoing, so it’s still a bit too early to tell, but we’re excited to already see some impressive performance improvements.

GPGPU Lessons

Migrating our chosen Map Algebra operations to the GPU was trivial in some cases, but quite challenging in others.  For example, local Map Algebra operations are quite easy to parallelize.  Each raster cell is compared to another raster cell in the same location.  This requires very little extra memory and few intermediate steps.

Focal (or “neighborhood”) operations are not simple, particularly for large neighborhoods, but with some intelligent partitioning, it’s relatively easy to work with different sections of the rasters.  This requires some intermediate buffers, but performs well using local memory on the GPU.

Zonal operations summarize the cells in one raster data set using zones identified in a second one.  These operations require a couple scans across a raster, and a few intermediate stages — that one was complicated.

What’s crazy difficult, though, are the global operations.  These operations are challenging because they operate on the entire dataset, with one cell on one side of the raster potentially affecting a cell on the other side of the raster. Examples of global operations include Euclidean distance, cost-weighted distance, viewshed analysis and others.  These operations are non-trivial to convert to GPU processing in a performant way, but it is definitely possible.

GPU Computing for GIS

We live in exciting times.

Computing power continues to grow at an exponential rate, and is well characterized by Moore’s Law (if you are looking for a graph more recent than 1965, try Wikipedia).  This means that computing power is moving in many directions.  The rise of laptops, notebooks, tablets, and smartphones are a testament to the increasing computing power of microprocessors.  They are getting faster, smaller, lighter, more power efficient, and sprouting more cores.

Despite this accelerating computing power, however, on some of our projects, we’ve seen how many heavy-duty analytical computing tasks remain too costly (in terms of computing time) to be run on the web with more than a small number of users.  However, by distributing the computation across multiple processors and machines, we have found it is possible to improve both the scalability and speed of some geographic data processing tasks.  For one such task, a weighted raster overlay operationg, we have been able to accelerate the process enough to make a scalable web application possible.  Azavea’s DecisionTree framework, developed with support from an SBIR grant from the US Department of Agriculture.

With this experience developing distributing geoprocessing algorithms, we have recently been taking a look at technologies that will enable us to make similar types of performance and scalability improvements.  One technology that we believe has great promise for bringing these processes to the web is General Purpose Computing on the Graphics Processing Unit (GPGPU).

GPGPU leverages the microprocessors that power many modern graphics cards.  NVidia and ATI are the largest players in the high performance video adapter field, and they both have GPU computing libraries that run on their video adapter hardware.

GPU’s are accelerating everything.

GPU’s are powerful for general purpose computing not just because of their clock speed, but because there are just so many multiprocessors on today’s GPU graphics cards.  While a quad-core CPU is a high-end processor for most servers, today’s high-end graphics cards have 100, 200 and 500 or more cores and are capable of gigaFLOPS double precision processing power (NVidia, ATI, respectively).  And these numbers are doing nothing but going up.

A few ways of comparing just what that means:

  • a handheld calculator runs at about 10 FLOPS (not giga-, just plain FLOPS, one billionth of a gigaFLOP).
  • by the time you blink your eye, 154 gigaFLOPS have occurred on the NVidia Tesla C2070.
  • by the time a hummingbird flaps it’s wings, 10.3 gigaFLOPS have occurred on the same card.
  • by the time one FLOP has occurred on the same card, your voice has only traveled through 0.64 μm of air (human hair ranges from 17-181 μm thick)

In addition to processors and processing speed, GPU cards have fast, specialized memory access.  They have a limited amount of local memory, but if you can figue out a way to use it efficiently, your memory access is on the order of 100x faster than conventional memory.

The combination of more processors and faster memory means that if you can discretize or parallelize the type of work that you want to perform, you can get radical speed improvements.

GIS on the GPU.

That’s all well and good, but how can GPGPU be used for GIS?  We are not the only ones thinking about this, but the answer depends on what kind of analysis you want to do.  We have been focusing our research on a few types of MapAlgebra operations, and our preliminary investigations have shown that all types of MapAlgebra operations can benefit from processing on the GPU.  In addition, we believe substantial improvements can be made in some types of vector processing with a few likely candidates would be:

  • Vector-to-raster and raster-to-vector conversion
  • Network analysis
  • Network routing
  • Transformations of geometric collections

All of these optimizations have the potential of reducing the computing time for heavy duty GIS operations from hours to minutes, and therefore minutes to seconds.  With that kind of speedup, the “attention threshold” of the web can be achieved.  It now becomes possible to run more complex GIS tasks in a web environment, bringing more computing power to the masses.

These changes won’t change the world right away, but it will make GIS analysis more interactive, responsive, and efficient.  Just imagine if you could complete any given task in your day in 1/10th the time (think Dash, from the Incredibles).