Tag Archives: Amazon Web Services

Getting an ArcGIS Server Map Cache in S3

When deciding how to best handle the air photos in the new Philadelphia Water Department Stormwater Map Viewer, we kicked around a few ideas. We decided to put the cache in Amazon’s Simple Storage Service to offload some of the local disk requirements and leverage their fast data storage and delivery infrastructure. In moving the process, we learned a few things:

Tune Your Cache

Make sure you spend time planning the cache. Not only will the cache look better in the final application, but it will also load to S3 faster and cost less in the long run.

  • Set the extents in the MXD or MSD before publishing to a map service. The overhead of transferring the 254 byte empty tiles caused a lot of unnecessary burden on the upload process as well as the fact that you are paying for them to be stored in the cloud. If it doesn’t need to be there, don’t build it.
  • Choose the correct image format for the cache. If you are caching a base map and do not need to support transparency, make it a JPEG. If it needs to support background transparency, use PNG. ESRI’s suggestions for planning a map cache can be found here.

Get a Good Tool to Transfer the Files

I started using the free version of Cloudberry Labs S3 explorer. But I had to move over 90 Gbs worth of data to my S3 bucket. The CloudBerry S3 Explorer – Pro supported multithreading which allowed for up to 5 threads to either enumerate through the folders, copy the files or apply the ACL. It is a low cost application that more than pays for itself when moving a lot of files up to a bucket.

When transferring the files up, I was working in blocks of directories, not the whole scale level. It was quicker for me to work in 20 to 30 subdirectories than grabbing a whole scale level. It did require a little bit more management on my end, but more steady progress was made.

Accessing the Tiles

ArcGIS Server does not support cloud hosted caches at the 9.3.1 release. The ESRI Javascript API and Flex API can be extended to use caches hosted in the cloud (Flex example from Mansour Raad), so you’ll have to roll your own. For the Philly Storm Water project, we were using the Open Layers and someone has rolled one for us. There is a patch that can be used to access the cache without communicating through ArcGIS Server straight from the client-side library. The one thing to note is that the Tile Origin is pretty touchy, we had to make some adjustments to the origin values to make sure everything lined up correctly.

Summary

Now that the site is up there and we are starting to get some traffic hitting it, putting the tiles in S3 was the right decision. There is no reason for ArcGIS Server to waste any cycles moving tiles around, let it do the heavy lifting with the vector layers and queries. Hopefully the rumors are true, and the ArcGIS Server 10 release will be more aligned with cloud computing. Until then, there are still plenty of ways to take advantage of the benefits.

Scaling Walkshed.org with Varnish and Amazon Web Services

We’re excited for voting to open today for our entry into the NYC Big Apps ContestWalkshed NYC.

Walkshed is very CPU intensive since we generate heatmaps for users’ custom walkability factors on the fly.  Building on the work we did with using Amazon’s content delivery network for RedistrictingTheNation.com, we decided to expand our use of Amazon Web Services (AWS) for Walkshed as well as incorporate technology from the open source Varnish project.

Varnish for hardening (and an easier life)

Varnish is a HTTP accelerator that runs on Linux (and other Unix style OSes).  We experimented with Varnish to solve a few goals:

  • Caching frequently requested files and heatmaps tiles (i.e. the default walkability heatmap tiles)
  • Scaling by letting Varnish load balance between multiple servers
  • Improving reliability by allowing Varnish to resubmit failed requests and monitor server health

By pointing Walkshed.org directly to Varnish, we are able to adjust server configurations on the fly without bringing down our application.   Currently, Varnish provides load balancing between 4 server instances which generate tiles  using Walkshed’s DecisionTree engine.  About 50% of the HTTP requests running through Varnish are cache hits, which helps eliminate unnecessary traffic clogging up our application servers.

One instance is hosted on our private server and is often able to meet demand, but adding 3 High-CPU Extra Large Instances from Amazon lets us improve fault tolerance and handle larger bursts in traffic.  Varnish also monitors the health of our servers and removes them from the cluster if they become unresponsive.

Amazon EC2 Instances (bigger is better)

Our Amazon instances are using the new EBS-based images to improve boot speed.   We’ve found that it takes about 7 minutes from when we launch an instance until it is successfully added to our Varnish pool, which certainly isn’t bad.   By combining Varnish with Amazon’s on-demand resources, we should theoretically be able to scale as much as necessary.  For this demo application, scaling is a manual process, but we are looking toward a future where the cluster would scale automatically based on demand.

We also experimented with a few EC2 instance sizes.   Since our application is CPU intensive we really found we had to go with the High-CPU Extra Large Instance to get decent performance.   The instances still don’t meet the performance we get on our private VMware-based server, but our hunch is that this is due to layers of virtualization causing memory allocation to be slow.

Technologies Used:

Bracing for (Potential) Catastrophic Success — Amazon’s Cloudfront CDN

Most of the web applications we build are either used internally by our clients or have a steady stream of public user activity.   With our recent Redistricting the Nation launch we wanted to experiment with some optimizations to make our site more resilient to traffic spikes as well as to improve the user experience.

Our strategy is broken down into a few components:

This post covers the Cloudfront CDN.

Previously, we had experimented with Amazon’s Web Services stack to host applications, but we hadn’t experimented with their Cloudfront CDN product.   Pricing for the CDN is quite similar to Amazon S3 and allows organizations to build scalable applications without the upfront cost of most CDNs.  We decided to use the CDN to host some large Javascript assets as well as our image components.

Cloudfront is quite easy to setup.   We simply created an Amazon S3 bucket called s3.azavea.com and pointed a CNAME record for s3.azavea.com to the full bucket domain — s3.azavea.com.s3.amazonaws.com.    Then, we enabled a Cloudfront distribution for the s3.azavea.com bucket using the free tool Cloudberry.   Finally, we setup a CNAME record for cdn.azavea.com to the Cloudfront distribution domain d17ib0dlm1q8qa.cloudfront.net and we were rolling.

Since the CDN is heavily cached, it was easiest to use s3.azavea.com links during development to reduce the amount of file versioning that was necessary.   Once we were settled on our assets, we switched to cdn.azavea.com links and started using the CDN.

The speed of the CDN is quite astounding.  Splitting assets across another domain name also improves the browser’s ability to request more files at once improving the user experience.  We were quite pleased with how easily we could offload assets to Cloudfront and realize gains with limited time investment.

A few notes to keep in mind when you are working with a CDN for the first time:

  • Since there is no way to flush assets out of Cloudfront’s edge nodes, be sure to use file name versioning.   This was a bit alien to us, but is easy to incorporate once you think it through.   For instance, we decided not to set a far-future expiration header on our PDF assets as they are often directly linked to and we wanted to be able to update them regularly.
  • Speaking of PDFs, it seems that while Cloudfront supports byte-range requests for assets, it doesn’t assert the “Accept-Ranges: bytes” HTTP header. This makes our large PDFs fully download before Adobe displays them within the browser.  Unfortunately there is no way to add this header at the moment.
  • Cloudberry is great to add HTTP headers to S3 assets.   We decided that most of our assets would have a six month cache lifespan by asserting the “Cache-Control: max-age=15552000″ header.