Azavea Labs

Where software engineering meets GIS.

Getting Started with Google Glass

Last week, I started my journey with Google Glass. As part of the Glass Explorers program, designers, developers, and other people with something exciting to share (including soccer moms, journalists, and Neil Patrick Harris) will have a sneak peak at what it’s like to live with the next generation of mobile technology. Over the next few months, I’ll be exploring ways to meld our work in GIS, crime analysis, and digital humanities through Azavea’s R&D program. But many people still don’t know what Glass does, why it’s exciting, and where this new technology will lead.

_DSC0018

What is Google Glass

At the orientation program in New York City’s Chelsea Market, Glass Explorer’s are given a full walk-through of how to use Glass by a trained Glass expert. The device, which is worn as a frame-less pair of glasses, displays a 640×360 pixel transparent screen, projecting information slightly above eye-level. This allows the screen to to fade away when not in use, avoiding distractions and saving precious battery life.

Glass features a handful of core features, including voice recognition for making and receiving phone calls, Google searching, taking photos and video (and sharing via social media), and turn-by-turn directions. Glass integrates with your personal Google account, pulling in your Gmail and Google Now data, like email, weather, point of interest information, and flight itineraries.

Why Glass is Exciting

The idea of wearable technology has been around in science fiction for a long time, and Glass isn’t the first attempt at this type of device. To those who see Glass as a glorified Bluetooth headset, I urge you to remember how new the device is (the iPhone 3G was released with an App Store of only 500 apps, many of them utility apps). The thing that makes Glass unique is the way that you use it to acquire information.

Need to look up a recipe? “Okay, Glass. Google recipes using chicken.” You’ll get your results side-by-side in your right eye, easily accessible for when you don’t have a free hand and out of sight when you don’t need it.

Glass isn’t about playing Angry Birds at the dinner table without anyone knowing. It’s about giving you updates and information quickly and easily, hands-free. Glass can also get and receive data via the Mirror API – making it easy for developers to build apps to give users useful data. This is where my research project comes in: finding ways to port the work we’ve done with HunchLab and PhillyHistory to the world of Glass.

What the Future Holds

Google has been releasing updates on a monthly basis for Glass. XE7 featured mobile web browsing and the ability to import all of your Gmail contacts into Glass for calls and sharing. In the coming months, Google will likely continue to work on improvements to the system, improve features and capabilities of the device itself (including battery life), and add more polish to the interface.

But the people who will make or break Glass (both literally and figuratively) are the Glass Explorers. The success of the device is going to rely on the development of apps, called Glassware. The release of an app store filled with Glassware will make the device incredibly useful to both professionals and casual users. Glass Developers will need to remember the point of this device – not for escaping reality, but improving it by making it easier to connect with the information we need quickly.

Will Glass replace the smartphone or tablet? Doubtful – at least not for quite some time. But its one of the first big contenders in the wearable technology arena and I see a lot of potential moving forward.

For more information and updates, follow me on Twitter @mike_tedeschi and read more on my blog, http://www.mtedeschi.com.

GPU Occupancy and Idling

As our ongoing research into raster processing for GIS on the GPU progresses, we have gone through various stages in the development of each Map Algebra operation.  Having converted a given operation to the GPU, we are finding that there are many potential ways to optimize, and this optimization process brings with it a host of issues that highlight the differences between sequential CPU programming and GPGPU parallel programming.

During the optimization process, we’ve found (and been told) that the single most important optimization is to ensure memory coalescence.  I blogged about that before, so if you haven’t seen it yet, it might be worth reading before you continue on.

After maximum memory coalescence has been achieved, it is possible to focus on 2 additional metrics: occupancy and idling.

Occupancy

The occupancy metric is defined as the number of active thread groups per processor divided by the maximum number of thread groups per processor.  It’s a value in the range of 0-100%.

Occupancy is the number of thread groups (NVidia calls them ‘warps’, ATI calls them ‘wavefronts’) that are active at one time.  At any one time, some thread groups may be processing data, and some thread groups may be accessing global memory.  When some thread groups are accessing global memory, these threads are effectively stalled for hundreds of instructions, while the other thread groups continue on.

Internally, the GPU has a thread group scheduler which controls when thread groups are executed. This is extremely useful, since highly parallel operations will utilize many thread groups to perform calculations. The GPU is highly parallel, but even it has its limits. This is where the thread group scheduler comes in — it can execute some of the thread groups, while other thread groups are idle, either completed or queued. This scheduling enables some thread groups to perform memory access, while other thread groups perform calculations.

Understanding the scheduler makes it possible to ‘hide’ these global memory accesses by performing ~100 arithmetic instructions between each global memory access.  Hypothetically, if the GPU executed a kernel that accessed global memory, performed a heavy-duty calculation, then saved that result, the occupancy would probably be pretty high. The thread group scheduler would schedule a set of thread groups for accessing global memory while scheduling another set of thread groups for heavy-duty calculation. This is effectively ‘hiding’ the memory access, since the GPU can perform computation instructions while accessing memory. Interestingly, there will be a point when increases to occupancy won’t improve your performance. It is at this point when all global memory accesses are ‘hidden’ by the computation, and it becomes time to look other places for optimization.

Idling

The idling metric is defined as the amount of time the GPU is idle divided by the overall execution time of the computation.  It’s a value in the range of 0-100%.

Idling is something that we have discovered to be critical to the performance of a calculation.  The reference and training documentation instructs GPGPU developers to keep the GPU as busy as possible for as long as possible, and stops there.  By creating this metric, we were able to measure just how much this idling was affecting our computation.

As it turns out, our initial experiments showed that our GPU was idle during periods of memory transfer to and from the CPU.  This idling of the GPU was extending the overall time for computation.  Minimizing this idling through asynchronous kernel execution and memory transfer resulted in a significant and immediate performance improvement.

Coalescence, Occupancy, Idling

To summarize, the best way to optimize your GPU computations is to investigate and optimize these three steps (and in this order):

  1. Memory coalescence
  2. Thread group occupancy
  3. GPU Idling

There are a number of smaller optimization that can be done as well, but we’ve found these to be the big 3.  Of course, you can continue this process forever, and demonstrate to your boss the law of diminishing returns.

GPU Memory Bandwidth and Coalescing

When one begins to work with GPGPU, the parallel processing benefits can be incredibly beneficial, if you know how to work with coalesced memory. This fits in with a parallel algorithm approach, incorporating the following:

  1. thinking about your computation in a data-parallel fashion.
  2. transferring working data into a local memory cache.
  3. considering scrutinizing how your code performs global memory accesses.

The first item almost goes without saying.  If you are hoping to leverage a massively parallel computing device, you obviously have to break your problem or computation down into discrete units that can be operated on in parallel.

It’s the second and third point that I am going to focus on in this post, since they are the most important factors when optimizing your GPGPU code.  The reason these are the most important factors are that local memory is so much faster at reading and writing than global memory, and the memory module in modern GPUs can perform concurrent reads to sequential global memory positions for an entire thread group.

Local Memory Caching

Use of a local memory cache may seem counter-intuitive to a programmer coming from CPU land.  The best analogy would be: storing your working data in RAM instead of on disk.  While not a perfect analogy, a CPU programmer understands perfectly the ramifications of such a design decision — any data accessed from disk will be retrieved more slowly than data accessed from RAM.  Likewise for local and global memory.  Local memory is on-chip memory that is exceptionally fast.  Global memory is off-chip memory that is often used to transfer data to/from the host (often the CPU).  I’m talking about a 100x speed difference when using local memory instead of global memory.

In addition to the differences in global and local memory, the memory bandwidth to/from the graphics card (which contains its own memory and processors) and the motherboard (which contains RAM and one or more CPUs) is another bottleneck.  Data transfer rates across the PCI Express 2.0 bus are about 8 GB/s.  Data transfer rates in the graphics card are around 141 GB/s.  So not only is the place in which you store your working data important, but also when and how you transfer that data to/from the GPU device itself.

Sequential Global Memory a.k.a. Coalescence

And “sequential global memory positions”? What is that?  Inside a GPGPU kernel, when accessing a portion of global memory, all threads in that group (NVidia calls them ‘warps’, and ATI calls them ‘wavefronts’) access a bank of memory at one time.  For example, if there are 16 threads executing with the same kernel, 16 sequential positions in global memory (1 position per thread) can be accessed in the same time that it would take 1 thread to read 1 position in memory.  If all memory accesses are performed this way, performance can speed up by a factor of 16 (in the memory access code).

That’s a wonderful way to speed up data-intensive operations, especially when one is working with raster data, and a given block of cells is accessed multiple times.  It is in this scenario that our research has recently landed us.

Another thing worth noting is that coalescence concept applies to global memory on the GPU only — local memory does not suffer the same performance hit, so does not need to take advantage of this technique.  But global memory access on the GPU takes about 100x as many instructions as local memory access.  This means that if you have coalesced global memory access, you are saving hundreds of instructions per thread.  This starts to add up when you consider that processing a raster may require hundreds or thousands of threads.

Armed with this knowledge, parallel algorithm implementations begin to have similar structures with regards to memory access.  The resulting code can be highly complex, though, and it’s not trivial to debug, but some new tools from NVidia and ATI are enabling developers to profile and visualize the work performed by the GPU. In my next post, I’ll discuss latency and occupancy, two metrics that one can use to help optimize GPU kernels.

GPUs and Parallel Computing Architectures

I’ve been blogging about GPUs recently, and I think you can tell it’s because I’m excited about the technology.  General Purpose Computing on the GPU (GPGPU) promises great performance increases in computationally heavy software, which we find immensely useful.  In the past, we’ve managed to engineer web-based applications (see: SmartConservation) that could run complex models by implementing a process queuing architecture, but in these systems, while they will run on the web, processing may still take several minutes and they therefore can neither provide a responsive user experience nor support many users.  We’ve also engineered a system that can perform fast, distributed raster calculations (see: Walkshed, powered by DecisionTree).

One of the reasons that GPGPU is so promising is the increasing number of processing cores available on affordable graphics cards.  This increases the computation capacity by leveraging many processors running in parallel. What’s interesting is that this technique is not new.  Timothy Mattson, blogging at Intel, has been doing this since the mid 80′s.  The Library of Congress contains a book on parallel computing structures and algorithms, dating back to 1969.

As we delve deeper into our work improving Map Algebra operations, important differences in algorithm approaches and implementations become apparent: not all parallel architectures are the same.  One might be tempted to think that when switching from the single-threaded CPU logic to multithreaded/parallel logic that there would be one model of parallel computing that is universal.  This is definitely not the case.

Three of the most popular types of parallel computing today are:

  • Shared-memory Multi-Processors (SMP)
  • Distributed-memory Massive Parallel Processors (MPP)
  • Cluster computing

Each type of parallel computing has its benefits and drawbacks.  It really just depends what kind of computing you need to do. I’ll describe these common computing types in detail, starting with the ‘traditional’ CPU model.

(more…)

GPU Computing for GIS

We live in exciting times.

Computing power continues to grow at an exponential rate, and is well characterized by Moore’s Law (if you are looking for a graph more recent than 1965, try Wikipedia).  This means that computing power is moving in many directions.  The rise of laptops, notebooks, tablets, and smartphones are a testament to the increasing computing power of microprocessors.  They are getting faster, smaller, lighter, more power efficient, and sprouting more cores.

Despite this accelerating computing power, however, on some of our projects, we’ve seen how many heavy-duty analytical computing tasks remain too costly (in terms of computing time) to be run on the web with more than a small number of users.  However, by distributing the computation across multiple processors and machines, we have found it is possible to improve both the scalability and speed of some geographic data processing tasks.  For one such task, a weighted raster overlay operationg, we have been able to accelerate the process enough to make a scalable web application possible.  Azavea’s DecisionTree framework, developed with support from an SBIR grant from the US Department of Agriculture.

With this experience developing distributing geoprocessing algorithms, we have recently been taking a look at technologies that will enable us to make similar types of performance and scalability improvements.  One technology that we believe has great promise for bringing these processes to the web is General Purpose Computing on the Graphics Processing Unit (GPGPU).

GPGPU leverages the microprocessors that power many modern graphics cards.  NVidia and ATI are the largest players in the high performance video adapter field, and they both have GPU computing libraries that run on their video adapter hardware.

GPU’s are accelerating everything.

GPU’s are powerful for general purpose computing not just because of their clock speed, but because there are just so many multiprocessors on today’s GPU graphics cards.  While a quad-core CPU is a high-end processor for most servers, today’s high-end graphics cards have 100, 200 and 500 or more cores and are capable of gigaFLOPS double precision processing power (NVidia, ATI, respectively).  And these numbers are doing nothing but going up.

A few ways of comparing just what that means:

  • a handheld calculator runs at about 10 FLOPS (not giga-, just plain FLOPS, one billionth of a gigaFLOP).
  • by the time you blink your eye, 154 gigaFLOPS have occurred on the NVidia Tesla C2070.
  • by the time a hummingbird flaps it’s wings, 10.3 gigaFLOPS have occurred on the same card.
  • by the time one FLOP has occurred on the same card, your voice has only traveled through 0.64 μm of air (human hair ranges from 17-181 μm thick)

In addition to processors and processing speed, GPU cards have fast, specialized memory access.  They have a limited amount of local memory, but if you can figue out a way to use it efficiently, your memory access is on the order of 100x faster than conventional memory.

The combination of more processors and faster memory means that if you can discretize or parallelize the type of work that you want to perform, you can get radical speed improvements.

GIS on the GPU.

That’s all well and good, but how can GPGPU be used for GIS?  We are not the only ones thinking about this, but the answer depends on what kind of analysis you want to do.  We have been focusing our research on a few types of MapAlgebra operations, and our preliminary investigations have shown that all types of MapAlgebra operations can benefit from processing on the GPU.  In addition, we believe substantial improvements can be made in some types of vector processing with a few likely candidates would be:

  • Vector-to-raster and raster-to-vector conversion
  • Network analysis
  • Network routing
  • Transformations of geometric collections

All of these optimizations have the potential of reducing the computing time for heavy duty GIS operations from hours to minutes, and therefore minutes to seconds.  With that kind of speedup, the “attention threshold” of the web can be achieved.  It now becomes possible to run more complex GIS tasks in a web environment, bringing more computing power to the masses.

These changes won’t change the world right away, but it will make GIS analysis more interactive, responsive, and efficient.  Just imagine if you could complete any given task in your day in 1/10th the time (think Dash, from the Incredibles).