GPU Occupancy and Idling

As our ongoing research into raster processing for GIS on the GPU progresses, we have gone through various stages in the development of each Map Algebra operation.  Having converted a given operation to the GPU, we are finding that there are many potential ways to optimize, and this optimization process brings with it a host of issues that highlight the differences between sequential CPU programming and GPGPU parallel programming.

During the optimization process, we’ve found (and been told) that the single most important optimization is to ensure memory coalescence.  I blogged about that before, so if you haven’t seen it yet, it might be worth reading before you continue on.

After maximum memory coalescence has been achieved, it is possible to focus on 2 additional metrics: occupancy and idling.

Occupancy

The occupancy metric is defined as the number of active thread groups per processor divided by the maximum number of thread groups per processor.  It’s a value in the range of 0-100%.

Occupancy is the number of thread groups (NVidia calls them ‘warps’, ATI calls them ‘wavefronts’) that are active at one time.  At any one time, some thread groups may be processing data, and some thread groups may be accessing global memory.  When some thread groups are accessing global memory, these threads are effectively stalled for hundreds of instructions, while the other thread groups continue on.

Internally, the GPU has a thread group scheduler which controls when thread groups are executed. This is extremely useful, since highly parallel operations will utilize many thread groups to perform calculations. The GPU is highly parallel, but even it has its limits. This is where the thread group scheduler comes in — it can execute some of the thread groups, while other thread groups are idle, either completed or queued. This scheduling enables some thread groups to perform memory access, while other thread groups perform calculations.

Understanding the scheduler makes it possible to ‘hide’ these global memory accesses by performing ~100 arithmetic instructions between each global memory access.  Hypothetically, if the GPU executed a kernel that accessed global memory, performed a heavy-duty calculation, then saved that result, the occupancy would probably be pretty high. The thread group scheduler would schedule a set of thread groups for accessing global memory while scheduling another set of thread groups for heavy-duty calculation. This is effectively ‘hiding’ the memory access, since the GPU can perform computation instructions while accessing memory. Interestingly, there will be a point when increases to occupancy won’t improve your performance. It is at this point when all global memory accesses are ‘hidden’ by the computation, and it becomes time to look other places for optimization.

Idling

The idling metric is defined as the amount of time the GPU is idle divided by the overall execution time of the computation.  It’s a value in the range of 0-100%.

Idling is something that we have discovered to be critical to the performance of a calculation.  The reference and training documentation instructs GPGPU developers to keep the GPU as busy as possible for as long as possible, and stops there.  By creating this metric, we were able to measure just how much this idling was affecting our computation.

As it turns out, our initial experiments showed that our GPU was idle during periods of memory transfer to and from the CPU.  This idling of the GPU was extending the overall time for computation.  Minimizing this idling through asynchronous kernel execution and memory transfer resulted in a significant and immediate performance improvement.

Coalescence, Occupancy, Idling

To summarize, the best way to optimize your GPU computations is to investigate and optimize these three steps (and in this order):

  1. Memory coalescence
  2. Thread group occupancy
  3. GPU Idling

There are a number of smaller optimization that can be done as well, but we’ve found these to be the big 3.  Of course, you can continue this process forever, and demonstrate to your boss the law of diminishing returns.


Want to know more about our GPU work?

Other posts in this series

  1. GPU Computing for GIS -
  2. What the heck is ... GPGPU? -
  3. CUDA, Stream, and OpenCL -
  4. GPUs and Parallel Computing Architectures -
  5. GPU Memory Bandwidth and Coalescing -
  6. GPU Occupancy and Idling (This post) -
Both comments and trackbacks are currently closed.

2 Comments

  1. Tom Gross
    Posted 28 July 2010 at 1:39 pm | Permalink

    Great series of blog posts on GPGPU. I see that you haven’t solved the problem of why there isn’t a “C” in “GPGPU” either. Just wanted to let you know that there are some of us here at the big old legacy code dinosaur who are thinking along the same lines:

    http://blogs.esri.com/Dev/blogs/apl/archive/2010/03/30/Computations-on-vector-data-using-a-GPU.aspx

    There is much work to be done. Aside from rewriting core functionality, we still have to be able to visualize results as quickly as they can be calculated. We still have to put our calculations and visualization into the server environment. We have to start operating in the 64-bit world . Much of GIS work is done on double precision data. The hardware vendors are just now beginning to produce GPUs that can handle double precision data well. It was a pleasure to read your blog.

  2. David Zwarg
    Posted 5 August 2010 at 8:35 am | Permalink

    Hello Tom,

    Thanks for your comment! I have no doubt that there is much work to be done. GPUs are coming out that support double precision natively; take a look at the new Tesla and Fermi cards from NVidia.

    GPUs are starting to making their way into the server environment — NVidia has at least 2 server products for intense processing: http://www.nvidia.com/page/servers.html.

    In addition, we’ve proven the technology works for a limited set of the operations supported by Map Algebra. Rewriting the Map Algebra library completely for GPGPU is going to be a lot of work. This is mostly because of the full suite of operations that are supported by “the big old legacy code dinosaur” — no small feat in itself, and kudos to the team that built those tools.