Azavea Labs

Where software engineering meets GIS.

GPUs and Parallel Computing Architectures

I’ve been blogging about GPUs recently, and I think you can tell it’s because I’m excited about the technology.  General Purpose Computing on the GPU (GPGPU) promises great performance increases in computationally heavy software, which we find immensely useful.  In the past, we’ve managed to engineer web-based applications (see: SmartConservation) that could run complex models by implementing a process queuing architecture, but in these systems, while they will run on the web, processing may still take several minutes and they therefore can neither provide a responsive user experience nor support many users.  We’ve also engineered a system that can perform fast, distributed raster calculations (see: Walkshed, powered by DecisionTree).

One of the reasons that GPGPU is so promising is the increasing number of processing cores available on affordable graphics cards.  This increases the computation capacity by leveraging many processors running in parallel. What’s interesting is that this technique is not new.  Timothy Mattson, blogging at Intel, has been doing this since the mid 80’s.  The Library of Congress contains a book on parallel computing structures and algorithms, dating back to 1969.

As we delve deeper into our work improving Map Algebra operations, important differences in algorithm approaches and implementations become apparent: not all parallel architectures are the same.  One might be tempted to think that when switching from the single-threaded CPU logic to multithreaded/parallel logic that there would be one model of parallel computing that is universal.  This is definitely not the case.

Three of the most popular types of parallel computing today are:

  • Shared-memory Multi-Processors (SMP)
  • Distributed-memory Massive Parallel Processors (MPP)
  • Cluster computing

Each type of parallel computing has its benefits and drawbacks.  It really just depends what kind of computing you need to do. I’ll describe these common computing types in detail, starting with the ‘traditional’ CPU model.

General Purpose

The ‘traditional’ processors used in many computers up until a few years ago were single core processors, called the Central Processing Unit (CPU).  The CPU was able to access a large, general-purpose memory bank, called Random Access Memory (RAM).

Simplified CPU Architecture

Simplified CPU Architecture

In contemporary computers, the CPU often contains more than one core, making CPU’s capable of more than one instruction at a time.  In addition to superscalar instruction processing, this makes modern CPUs much faster than their single core, scalar predecessors.

Shared-memory Multiprocessors

An SMP architecture is probably the one parallel computing architecture that is most like the general purpose architecture with which we are familiar. SMP are a set of processors that all have their own local memory.  These memory banks are shared within a thread group, but not between more than one thread group.  However, each processor also has access to a global memory bank, which is shared between all processors.

Simplified Shared-memory Multiprocessor Architecture

Simplified Shared-memory Multiprocessor Architecture

This is the parallel architecture that NVidia and AMD/ATI both use in their GPUs.  Likewise, it’s also the model enforced in the OpenCL specification.

Distributed-memory Massive Parallel Processors

The most complicated and flexible architecture type is MPP.  MPP systems isolate memory and processors together, and as such, have no common or shared memory.  Each processor has a dedicated block of local memory, and communicates with other processors via a bus or network.  By varying the number of processors each processor is connected to, different types of MPP systems can be created.  For example:

  • Linear array: if the processors were arranged in a line, each processor is connected to 2 neighboring processors

    Simplified Linear Array Architecture

    Simplified Linear Array Architecture

  • Linear ring: if the processors were arranged in a circle, each processor is connected to 2 neighboring processors (a linear array, with the ends connected)
  • Mesh: if the processors were arranged in a grid, each processor is connected to up to 4 neighbors (3 on the edges, and 2 in the corners)

    Simplified Mesh Architecture

    Simplified Mesh Architecture

  • Tree: if the processors were arranged in a hierarchical manner, with each processor connected to the processor above it, and two processors below it.
  • Pyramid: if the processors were arranged similar to a tree, but in three dimensions, with each processor connecting to four processors below it.

    Simplified Pyramid Architecture

    Simplified Pyramid Architecture

  • Cube: if the processors were arranged similar to a mesh, but in three dimensions, with each processor connected to up to 6 neighbors.
  • Hypercube: if the processors were arranged similar to a cube, but in four dimensions, with each processor connected to up to 8 neighbors.

As you can see, the processors in MPP systems can proliferate quite rapidly with more complex processor network topologies.   We haven’t worked with any MPP systems for our GPU research, so I’ll let you ponder that while I return to the GPU architecture.

GPU Memory – Not Your Father’s RAM

As I mentioned above, GPUs and OpenCL implementations are based on the SMP architecture.  As such, GPUs have multiple types of memory, with different implications for each type.

  • Global memory: this is often the big number on the graphics card packaging.  512 MB DDR, etc.  This is the amount of global memory that is available to the GPU processors.  This memory is essentially used as a fast cache to the motherboard RAM, since it’s used to transfer raw data to the GPU for processing, and storing computation results prior to reading out back to motherboard RAM.
  • Local shared memory: this is a much smaller bank of memory that is extremely fast.  On the hardware that we’re using, it’s limited to 16KB. With some smart memory management, this local memory can really speed up computations, since the instruction cost of accessing this memory is 1% of that required to access global memory.  Also, this memory is shared between all threads in a work-group.
  • Private thread memory: this is an extremely small bank of memory that can be used within each thread for variables and temporary storage during your computation. Interestingly, in the NVidia implementation, this uses registers for a certain amount, then starts using global memory when registers are exhausted.

The differences in memory types are probably the first thing a general purpose GPU programmer will run into. Another thing to keep in mind is the method by which the GPU achieves such high throughput, and that’s thread parallelism.

Single Instruction Multiple Threads

In OpenCL, each parallel code path executes one kernel.  The best possible outcome (in regards to thread synchronicity) is when each kernel executes the exact same instructions as all other threads.  With each thread managing a different nugget of data, this results in extremely fast execution.  However, if the kernel code diverges or branches, there is a performance penalty: that section of your code will execute serially (think 16x to 32x slower).

NVidia implements this architecture, and has called it Single Instruction Multiple Threads (SIMT). It’s kind of like line-dancing for threads.  All threads that execute the same instructions can perform together.  If a thread diverges or branches, then the line-dance is broken, and each thread processes a divergent section one after another.  What’s kind of cool, though, is that the threads will join back up after diverging, and continue on together.

Wrapping it up

With a solid understanding of how the GPU operates in addition to the limitations of memory and threading, it’s relatively easy to start computing on the GPU.  Many common operations are easily parallelizable, such as sorting and basic mathematical operations.  When you start performing serious number crunching, or if you are porting a beefy algorithm from serial CPU code, that’s when the real fun begins.