When one begins to work with GPGPU, the parallel processing benefits can be incredibly beneficial, if you know how to work with coalesced memory. This fits in with a parallel algorithm approach, incorporating the following:
- thinking about your computation in a data-parallel fashion.
- transferring working data into a local memory cache.
- considering scrutinizing how your code performs global memory accesses.
The first item almost goes without saying. If you are hoping to leverage a massively parallel computing device, you obviously have to break your problem or computation down into discrete units that can be operated on in parallel.
It’s the second and third point that I am going to focus on in this post, since they are the most important factors when optimizing your GPGPU code. The reason these are the most important factors are that local memory is so much faster at reading and writing than global memory, and the memory module in modern GPUs can perform concurrent reads to sequential global memory positions for an entire thread group.
Local Memory Caching
Use of a local memory cache may seem counter-intuitive to a programmer coming from CPU land. The best analogy would be: storing your working data in RAM instead of on disk. While not a perfect analogy, a CPU programmer understands perfectly the ramifications of such a design decision — any data accessed from disk will be retrieved more slowly than data accessed from RAM. Likewise for local and global memory. Local memory is on-chip memory that is exceptionally fast. Global memory is off-chip memory that is often used to transfer data to/from the host (often the CPU). I’m talking about a 100x speed difference when using local memory instead of global memory.
In addition to the differences in global and local memory, the memory bandwidth to/from the graphics card (which contains its own memory and processors) and the motherboard (which contains RAM and one or more CPUs) is another bottleneck. Data transfer rates across the PCI Express 2.0 bus are about 8 GB/s. Data transfer rates in the graphics card are around 141 GB/s. So not only is the place in which you store your working data important, but also when and how you transfer that data to/from the GPU device itself.
Sequential Global Memory a.k.a. Coalescence
And “sequential global memory positions”? What is that? Inside a GPGPU kernel, when accessing a portion of global memory, all threads in that group (NVidia calls them ‘warps’, and ATI calls them ‘wavefronts’) access a bank of memory at one time. For example, if there are 16 threads executing with the same kernel, 16 sequential positions in global memory (1 position per thread) can be accessed in the same time that it would take 1 thread to read 1 position in memory. If all memory accesses are performed this way, performance can speed up by a factor of 16 (in the memory access code).
That’s a wonderful way to speed up data-intensive operations, especially when one is working with raster data, and a given block of cells is accessed multiple times. It is in this scenario that our research has recently landed us.
Another thing worth noting is that coalescence concept applies to global memory on the GPU only — local memory does not suffer the same performance hit, so does not need to take advantage of this technique. But global memory access on the GPU takes about 100x as many instructions as local memory access. This means that if you have coalesced global memory access, you are saving hundreds of instructions per thread. This starts to add up when you consider that processing a raster may require hundreds or thousands of threads.
Armed with this knowledge, parallel algorithm implementations begin to have similar structures with regards to memory access. The resulting code can be highly complex, though, and it’s not trivial to debug, but some new tools from NVidia and ATI are enabling developers to profile and visualize the work performed by the GPU. In my next post, I’ll discuss latency and occupancy, two metrics that one can use to help optimize GPU kernels.





2 Comments
Please don’t say local memory when you mean shared memory. NVIDIA’s nomenclature is fuzzy enough with mixing up the terms. “Shared memory” is the small but fast on-chip memory. “Local memory” is just normal global memory that is only visible to a single thread and is typically used when there aren’t enough registers to accommodate your memory needs (referred to as “register spilling”). See the programming guide section 5.3.2.
Also, saying that coalescing does not apply to shared memory is not entirely accurate, although the term is different in that case. If two or more threads request memory from the same shared memory address, there is a bank conflict and the requests have to be serialized (5.3.2.3). The exception to this is if all requests are from the same address, the bank conflict can be avoided with a broadcast. That being said, coalescing global memory is a higher priority as having to make many trips to memory at a cost of 400-600 cycles a hit is much worse than the 4-6 cycles per shared memory access, but it isn’t negligible.
Hi Matthew,
Thanks for your feedback. You are correct about the nomenclature — NVidia and ATI have different names for everything. I used the term local memory since most of the work that we are doing is in OpenCL, which has three different types of memory: global, local, and private. The corresponding NVidia terms are global, shared, and “private local” — it’s easy to mix them up, especially when you aren’t clear which API you are using (in my case).
In regards to your note about shared memory coalescing: there is a small performance hit in the shared memory accesses if they are not coalesced, you are correct about that. In relative terms, when converting algorithms from the CPU to the GPGPU, ensuring that global memory is coalesced takes a much higher priority due to the number of instructions involved relative to other memory optimizations, like coalescing shared memory. I will agree with you that it isn’t negligible; if we were trying to squeeze out every last drop of speed out of the GPU, we’d spend more time optimizing for that as well.