avatarArticles by David Zwarg

CUDA, Stream, and OpenCL

Computing on the GPU, or GPGPU, is a steadily maturing technology.  There are many technologies out in the wild that will enable you to use GPU’s for computation, but there’s a catch: the vendors are still vying for the lead.  The two market leaders are currently NVidia and AMD/ATI.

That means that NVidia is pushing their GPGPU API, which is named Compute Unified Device Architecture,” or CUDA.  Their rival, AMD/ATI, is pushing Stream. Stream incorporates BrookGPU, a compiler and data-parallel language developed at Stanford University, which predates CUDA.

NVidia & CUDA, ATI & Stream, or OpenCL

NVidia & CUDA, ATI & Stream, or OpenCL

Both of these vendor APIs are proprietary, and run on each vendor’s specific hardware.  This makes sense if a developer can control what hardware computations will be using. Realistically, a developer rarely has such control. So what are the options? At the current time, there are only a couple:  OpenCL and Microsoft’s DirectCompute technology.  Microsoft’s technology is limited to Windows Vista and Windows 7, though, so we are focusing on OpenCL.

OpenCL is the Open Computing Language, a language that extends the C99 standard (a modern dialect of the C programming language) and compiles into device-specific binaries. OpenCL was originally developed by Apple, and handed over to the Khronos Group. The OpenCL standard was ratified by the consortium in December of 2008.  The Khronos Group consortium includes all the major players in the field, including NVidia, AMD/ATI, Apple, and Intel.  The list is much more extensive, but those are the four to be happy about.  Intel doesn’t support OpenCL in their multicore CPUs, but I’m optimistic that they will release an OpenCL API to leverage CPU cores as well as GPU cores as computing devices.

OpenCL was created to address the need for speed in current desktop systems that contain GPU processors.  The language was created to address computing on heterogeneous systems, which, when you think about it, can include many other types of computing devices.  If OpenCL is adopted by Android, then you could optimize code to run on Android devices, too. While this may not be the fastest approach, it would potentially let you distribute work among devices.

One caveat to heterogeneous systems, though: OpenCL kernels that are written and optimized for one hardware platform probably won’t perform the same as on another hardware platform.  While OpenCL enables developers to write code that can run on multiple hardware devices, the hardware implementations may vary.  For example, the number of processor cores, and thus the number of parallel threads may vary widely.

If you can’t tell already, we are sold on the promise of OpenCL for GPGPU.  The language is easy to use (if you already know C), and it supports the two biggest players in the GPU market, NVidia and AMD/ATI.  We are hoping that Intel releases their OpenCL drivers for CPUs, too, so that we can squeeze out the last drip of computing power for our computations.

What the heck is … GPGPU?

General Purpose Computing on the Graphics Processing Unit, or GPGPU, is a technology that enables software to use the computing power of the multiple processing units that come with modern graphics cards.  The benefits of using these processing units is that computational work gets done by many workers at once, instead of one CPU or a few threads on a multi-core CPU.

At first, it may sound like multi-threading, and in a way it is.  What’s key, though, is that the GPU doesn’t just multi-thread, it’s synchonized, lock stepped multi-threading.   It’s called Single Instruction, Multiple Data, or SIMD in Flynn’s Taxonomy (NVidia’s hardware uses a variant called Single Instruction, Multiple Threads, or SIMT).  This refers to the control flow in each thread on the GPU — to maximize performance, it’s important to get all threads running through the same logic at the same time.  In practice, some threads will eventually branch, and keeping the branching to a minimum is critical.

When we have processing tasks that can be highly discrete, the SIMT nature of the GPU enables computations to really fly when using GPGPU.  Structuring computations in this fashion will result in operations that are highly parallelized.  Coincidentally, they will also be easily distributed to other computation engines.  But unlike other distributed computing methods, GPGPU is tightly coupled, and has very low latency — you get immediate results.

OpenCL Logo

To facilitate all of above, GPGPU requires a video card that supports programmability.  While the major GPU manufacturers have released proprietary GPGPU languages — for example is NVidia’s CUDA framework — a cross-platform, device-independent alternative, called OpenCL, has recently become sufficiently robust that we can begin to use it for implementing GPGPU processing.  While originally developed by Apple, OpenCL is now overseen by a non-profit organization, the Khronos Group.

If you’ve got a GPU on hand, it’s quite likely that there is an OpenCL API available in your programming language of choice.  Here’s a short list from the Khronos Group’s website:

By the way, Khronos just released the spec for OpenCL 1.1 on June 14th, 2010.

GPGPU for GIS

Our interest in GPGPU is motivated by our desire to speed up complicated raster math calculations, in particular those that would enable us to make more sophisticated geoprocessing available on web and mobile platforms.  Our vision is to make an order of magnitude or better improvements in geospatial data processing and thereby make web-based analysis tools more responsive and compelling.  To support our research we applied for and were awarded a National Science Foundation (NSF) Small Business Innovation Research (SBIR) grant to benchmark some traditional GIS operations against  GPU-accelerated versions of the same operations.  Our objective is to improve these operations by about 20x, and it looks promising.

We’re sampling the wide array of Map Algebra operations that are in standard GIS toolkits, focusing on a couple examples of each of the following types of operations: local, focal, zonal, and global.  Some of these types of operations are easily parallelized, such as local and focal operations.  The other operations require a fair amount of algorithm wrangling before they show any improvement on the GPU’s parallel architecture.

Our research is ongoing, so it’s still a bit too early to tell, but we’re excited to already see some impressive performance improvements.

GPGPU Lessons

Migrating our chosen Map Algebra operations to the GPU was trivial in some cases, but quite challenging in others.  For example, local Map Algebra operations are quite easy to parallelize.  Each raster cell is compared to another raster cell in the same location.  This requires very little extra memory and few intermediate steps.

Focal (or “neighborhood”) operations are not simple, particularly for large neighborhoods, but with some intelligent partitioning, it’s relatively easy to work with different sections of the rasters.  This requires some intermediate buffers, but performs well using local memory on the GPU.

Zonal operations summarize the cells in one raster data set using zones identified in a second one.  These operations require a couple scans across a raster, and a few intermediate stages — that one was complicated.

What’s crazy difficult, though, are the global operations.  These operations are challenging because they operate on the entire dataset, with one cell on one side of the raster potentially affecting a cell on the other side of the raster. Examples of global operations include Euclidean distance, cost-weighted distance, viewshed analysis and others.  These operations are non-trivial to convert to GPU processing in a performant way, but it is definitely possible.

GPU Computing for GIS

We live in exciting times.

Computing power continues to grow at an exponential rate, and is well characterized by Moore’s Law (if you are looking for a graph more recent than 1965, try Wikipedia).  This means that computing power is moving in many directions.  The rise of laptops, notebooks, tablets, and smartphones are a testament to the increasing computing power of microprocessors.  They are getting faster, smaller, lighter, more power efficient, and sprouting more cores.

Despite this accelerating computing power, however, on some of our projects, we’ve seen how many heavy-duty analytical computing tasks remain too costly (in terms of computing time) to be run on the web with more than a small number of users.  However, by distributing the computation across multiple processors and machines, we have found it is possible to improve both the scalability and speed of some geographic data processing tasks.  For one such task, a weighted raster overlay operationg, we have been able to accelerate the process enough to make a scalable web application possible.  Azavea’s DecisionTree framework, developed with support from an SBIR grant from the US Department of Agriculture.

With this experience developing distributing geoprocessing algorithms, we have recently been taking a look at technologies that will enable us to make similar types of performance and scalability improvements.  One technology that we believe has great promise for bringing these processes to the web is General Purpose Computing on the Graphics Processing Unit (GPGPU).

GPGPU leverages the microprocessors that power many modern graphics cards.  NVidia and ATI are the largest players in the high performance video adapter field, and they both have GPU computing libraries that run on their video adapter hardware.

GPU’s are accelerating everything.

GPU’s are powerful for general purpose computing not just because of their clock speed, but because there are just so many multiprocessors on today’s GPU graphics cards.  While a quad-core CPU is a high-end processor for most servers, today’s high-end graphics cards have 100, 200 and 500 or more cores and are capable of gigaFLOPS double precision processing power (NVidia, ATI, respectively).  And these numbers are doing nothing but going up.

A few ways of comparing just what that means:

  • a handheld calculator runs at about 10 FLOPS (not giga-, just plain FLOPS, one billionth of a gigaFLOP).
  • by the time you blink your eye, 154 gigaFLOPS have occurred on the NVidia Tesla C2070.
  • by the time a hummingbird flaps it’s wings, 10.3 gigaFLOPS have occurred on the same card.
  • by the time one FLOP has occurred on the same card, your voice has only traveled through 0.64 μm of air (human hair ranges from 17-181 μm thick)

In addition to processors and processing speed, GPU cards have fast, specialized memory access.  They have a limited amount of local memory, but if you can figue out a way to use it efficiently, your memory access is on the order of 100x faster than conventional memory.

The combination of more processors and faster memory means that if you can discretize or parallelize the type of work that you want to perform, you can get radical speed improvements.

GIS on the GPU.

That’s all well and good, but how can GPGPU be used for GIS?  We are not the only ones thinking about this, but the answer depends on what kind of analysis you want to do.  We have been focusing our research on a few types of MapAlgebra operations, and our preliminary investigations have shown that all types of MapAlgebra operations can benefit from processing on the GPU.  In addition, we believe substantial improvements can be made in some types of vector processing with a few likely candidates would be:

  • Vector-to-raster and raster-to-vector conversion
  • Network analysis
  • Network routing
  • Transformations of geometric collections

All of these optimizations have the potential of reducing the computing time for heavy duty GIS operations from hours to minutes, and therefore minutes to seconds.  With that kind of speedup, the “attention threshold” of the web can be achieved.  It now becomes possible to run more complex GIS tasks in a web environment, bringing more computing power to the masses.

These changes won’t change the world right away, but it will make GIS analysis more interactive, responsive, and efficient.  Just imagine if you could complete any given task in your day in 1/10th the time (think Dash, from the Incredibles).

Azavea’s Coffee Helper: Caféduino

Azavea has a clearly defined symbiosis with coffee. We have a designated Minister of Coffee, an incredible coffee grinder, and we get selections of coffee from around the world; some of it hand-delivered, and some of it hand-crafted.

One of the problems opportunities that I observed was that as coffee was brewed and consumed in the office (about 6 brews a day), there would often be an unlucky staffer who picked up the (opaque) coffee pot to find it empty.  I don’t personally drink coffee, so I was unaware of how much anticipation one would have when approaching the pot.  Not knowing how much coffee was in there, and worrying if the mug you pour may be half empty or half full.

I slowly set to work in my free time to solve this conundrum.  To me, it seemed like an individual would want to know if there was coffee in the coffee pot before approaching the coffee maker (whom some of us address reverently as Zojirushi-san).  This could be 1) a web page, 2) a desktop app, or 3) an IRC (not IIRC) bot.

I drew up some schematics, took some measurements, and retreated to my home lab to build an Arduino based, web-enabled measurement system tailored for the coffee pot. I used an Arduino Diecimila, an Ethernet Shield, a couple piezoelectric sensors, a 3 color LED, a couple buttons, and an awesome hand-crafted wooden base.

A short while later, I had a working prototype ready for testing. This device now sits in Azavea’s kitchen, measuring the weight of the coffee maker, and reporting the measurements to pachube.  After doing some internal evaluation, the name ‘Caféduino’ stuck, and I developed a couple methods of viewing the coffee pot status.

  1. Direct web access
    The web page generated by the Cafeduino

    The web page generated by the Caféduino

    Using this method, it’s possible to directly address the Caféduino.  This gives one direct access to the measurement values, but is more useful for other applications that are polling the data frequently.

  2. Caféduino Notification
    The system tray notification app.

    The system tray notification app.

    Using this method, the Caféduino is polled continuously, and the tiny coffee mug in the system tray is updated as the coffee level changes.  This is the most aggressive method of monitoring the Caféduino which, mysteriously, is the most comforting for users.

    Visualizing the Cafeduino history.

    Visualizing the Caféduino history.

    When the coffee mug is clicked, the history of the Caféduino is charted in the window.  What you are seeing is a Google visualization applet that is consuming the historical data, stored on http://www.pachube.com/.

  3. IRC bot integration
    A sample IRC conversation with the IRC bot.

    A sample IRC conversation with the IRC bot.

    Lastly, the most interactive method of polling the Caféduino is through our internal IRC channel. The above screenshot is the conversation that I initiated with the IRC bot, and its response.  It has reassured me that there is indeed, 77.78% of a pot of coffee left.

Now our staff can check in on the coffee pot, to insure that their next visit to the kitchen will be without disappointment.  While this system works well for monitoring the coffee level, the next steps may be more involved – building a machine to automatically brew coffee.

Azavea Enters MassDOT Developers Real Time Challenge

We here at Azavea like playing with data. It’s a fact. All sorts of data (geographically aggregated, processed and hunched, walkable, etc) are useful, but not always accessible. So when MassDOT (the Massachusetts Department of Transportation) announced their Developers Real Time Challenge, we knew it was an opportunity that we didn’t want to pass up.

MassDOT has been making steady progress opening transportation data to developers over the past few months. In September of 2009, they first released 2 full days worth of passenger data, and challenged the developer community to play with it to develop a visualization and application that used that data. Then in November of 2009, they hosted a developer conference, announced the winners of their first challenge, and announced the release of a real-time XML feed. They kept that ball rolling in February 2010 with a Hackathon at MIT with the Center for Future Civic Media, and then the announcement of the Real Time Challenge.

BusMinder screenshot

Screenshot of BusMinder in action.

My response to the real time challenge: BusMinder (http://sandbox.azavea.com/mbta/).  BusMinder is an experiment, designed to enable users to create bus reminders (‘busminders’) for their favorite bus stop(s).  Users walk through the process of:

  1. Selecting a bus stop
  2. Selecting a reminder type (sms or email)
  3. Selecting a time window

It’s that simple to get started. The application will save those settings and remind the user when an MBTA vehicle is approaching that stop, and how far away it is in minutes.  Once a user registers, they can set up multiple reminders (commuting to work and commuting to home, for example).

I introduce BusMinder as an experiment because MassDOT has published this real time XML data feed on an ‘trial’ basis. Currently, only 5 bus lines are supported, and it’s understood that the data is experimental.  In fact, one of the bullet points for developers working with MassDOT data is “Expect change”.  My hope is that this application is found to be useful, and it helps MassDOT release more of their real time transit information.

I also want to give a shout out to Robert Cheetham, for Azavea’s generous Research program.  I was able to shift around my projects a bit and take on this challenge under the auspices of my independent research.  Thank you, Robert.

Ok, ok, enough already!  Go to http://sandbox.azavea.com/mbta/ and start playing!

http://civic.mit.edu/

Envisioning Development

This is so simple, it’s cool: http://envisioningdevelopment.net/map

I especially like the hourglass-like effect way of populating the columns. It gives one the feel of really counting things. Like when you switch between East Harlem and the Upper East Side.

I would like to be able to see the distribution over the whole city, or the gradients between neighborhoods, but that’s just me. I think the design is neat and clean, and tells a very compelling story.

Ignite: Spatial, Boston

I got the opportunity to present at Ignite: Spatial, Boston a couple weeks ago.  I was fortunate to present Sourcemap.org in the company of other Boston area techies doing some cool work in laser scanning, CityML, social media and more.

All the videos are on YouTube. The presentation summaries are also online in this Google Doc.

Enjoy your spatial ignition this morning.