On Cascading Failures and Amazon's Elastic Block Store

This post is one in a series discussing storage architectures in the cloud. Read Network Storage in the Cloud: Delicious but Deadly and Magical Block Store: When Abstractions Fail Us for more insight.

Resilient, adjective, /riˈzilyənt/ "Able to withstand or recover quickly from difficult conditions".

In patients with a cough, you know what commonly causes them to keep coughing? Coughing.

Nearly 4 years ago I wrote a post titled "Why EC2 isn’t yet a platform for 'normal' web applications" and said that the "No block storage persistence" was a feature of EC2: Making it fine for such things as batch compute on objects in S3 but likely making it difficult for people expecting to use then-state-of-the-art databases.

Their eventual solution was to provide what most people are familiar with, basically a LUN coming off of a centralized storage infrastructure. Thus the command of /mount comes back into use and one can start booting /root partitions from something other than S3. While there was the opportunity to kill centralized SAN-like storage, it was not taken.

The laptop-like performance of API-accessible EBS volumes coupled with highly variable latencies (an intractable problem for a black box on a black box style of architectures) has lead to a lot of hacks: Automated deployments of massive mirrored raid sets, "burning in" volumes and then tossing them out if insufficient and many many others.

Let's think of a failure mode here: Network congestion starts making your block storage environment think that it has lost mirrors, you begin to have resilvering happen, you begin to have file systems that don't even know what they're actually on start to groan in pain, your systems start thinking that you've lost drives so at every level from the infrastructure service all the way to "automated provisioning-burning-in-tossing-out" scripts start ramping up, programs start rebooting instances to fix the "problems" but they boot off of the same block storage environment.

You have a run on the bank. You have panic. Of kernels. Or language VMs. You have a loss of trust so you check and check and check and check but the checking causes more problems.

You begin to learn the same lessons of the internet, the initial lack of back-off in the networking protocols famously induced congestive collapse in 1986. The lessons of Congestive Collapse.

Congestive collapse (or congestion collapse) is a condition which a packet switched computer network can reach, when little or no useful communication is happening due to congestion. Congestion collapse generally occurs at choke points in the network, where the total incoming bandwidth to a node exceeds the outgoing bandwidth. Connection points between a local area network and a wide area network are the most likely choke points. A DSL modem is the most common small network example, with between 10 and 1000 Mbit/s of incoming bandwidth and at most 8 Mbit/s of outgoing bandwidth.When a network is in such a condition, it has settled (under overload) into a stable state where traffic demand is high but little useful throughput is available, and there are high levels of packet delay and loss (caused by routers discarding packets because their output queues are too full) and general quality of service is extremely poor.

Except most don't have a packet-like view of the non-networking parts of their infrastructure: They have a black box machine view on top of a black box block view. Where not only do you have turtle upon turtle, they're not even aware of self or each other.

In this type of system, this is the failure mode and it will only get worse as it gets larger, older and less trusted. Less trusted by the users, less trusted by the system itself. The most pernicious bit is this lack of trust is being built into software that automatically manages these systems, most notably EBS itself which seems to have aided in this collapse by its aggressive re-mirroring.

This is not a "speed bump" or a "cloud failure" or "growing pains", this is a foreseeable consequence of fundamental architectural decisions made by Amazon. This is also not The Big One, it's a foreshock. Keep in mind that AWS now dwarfs the infrastructure for amazon.com, and they're learning as they go.

Now we're more than 24 hours in a failure of said system I'd like to point out a couple of things. I'm not going to bother fleshing all of these out or making dissertation-like defenses of each quite yet, I'm simply going to put them out there for you to think about because each one could be a chapter in a book.

  • While there are many strengths to network storage: Failure modes, particularly in a multi-tenant environment of unknown workloads, is not one of them.
  • Multi-tenant block storage as a service over network to arbitrary operating systems with arbitrary file systems are asking for a world of pain where one sits burning in a kiln of fire.
  • A 1 Gbps at wire speed lets you move 0.4 TB/hour. Running a block storage service over a shared 1 Gbps network that never runs at wire speed will always limit your recovery time.
  • Recovering a LUN doesn't mean that the file system that is on it is okay. Having the file system be okay doesn't mean that the datastore is okay.
  • Resiliency: Success is the absence of failures that kill you. You will not do a particular thing that leads to your success.

Now with those thoughts, my concern now moves to those featured in cases studies (via jeffbarr).

Back Online - Disaster Recovery. As most of you noticed, we were offline for about 18 hours. This was due to a major failure by our web host, Amazon Web Services. This issue knocked offline thousands of sites on Thursday. As of 5AM on Friday the outage is still on going, now over 24 hours later. We are back online now, but unfortunately had to go to our database backups to recover. This means that we lost about a day's worth of data: results, meet entries, forum posts, etc. Anything from about 7AM on 4/20 through 6AM on 4/21 will need to be re-entered. Thank you for your understanding and patience. We are very sorry for the problem. Let us know of any issues - support@milesplit.com

Over the upcoming days we will see many cases where data was actually lost, and yes, while it will be "their fault," I'm expecting to see a Magnolia or two.

Visit http://www.joyentcloud.com/ to learn about how Joyent's public cloud, JoyentCloud, is different from Amazon's AWS.

Photo by Flickr user loco085.



Post written by jason