April 25, 2011 - by jacksonwest
This post is one in a series discussing storage architectures in the cloud. Read Magical Block Store: When Abstractions Fail Us and On Cascading Failures and Amazon’s Elastic Block Store for more insight.
One of the persistent themes of my career has been something of a love affair with understanding systems failure -- I fervently believe that systems have the most to teach us when they fail. And while this interest in failure hasn't made my own any easier to stomach, my career has been blessed with plenty of opportunity for study: For an infrastructure provider, failures can be devastating, resulting in downtime, data loss and worse. And for a cloud provider, failures are not only just as devastating, but much more potentially widespread in their impact. As such, the massive, systemic failure in Amazon's US-EAST-1 region (and the prolonged downtime that it induced), provides an occasion to reflect on the nature of failure on the cloud. While it is true that some level of systems failure is endemic (and that systems must be designed with these failures in mind), this cannot become an excuse for unreliable systems. Failures must be understood if we are to collectively learn from them.
For Joyent, this has been an opportunity to reflect back on some of our own failures -- and specifically how they changed the way we think about the cloud. You may have seen Jason on cascading failure or Mark on the perils of abstraction. I wanted to add my own perspective on Joyent's experience with network storage.
I knew Jason and Mark (and Dave and Ben) long before I came to Joyent in August, and I remember when they were not so jaundiced about network storage: I recall having a discussion with Jason in 2006 when he excitedly talked of his plans for using block-based network storage as the bedrock of a large-scale IaaS offering. He was enthusiastic -- it all seemed so elegant and fool-proof! -- and as I was developing a network storage appliance at the time, so (naturally) was I. I do remember us both wondering aloud if having a copy-on-write filesystem backed by a LUN that was in fact a remote copy-on-write filesystem was going to have new performance pathologies, but like doomed hikers ignoring a distant storm cloud, we collectively shrugged our shoulders and we each pushed on down our respective paths...
Five years later, we're all older and wiser -- and we each independently learned some very tough lessons about the peril of network storage in a multi-tenant environment. Jason summarized that lesson euphemistically:
While there are many strengths to network storage: Failure modes, particularly in a multi-tenant environment of unknown workloads, is not one of them.
Anyone who has stood anything up in this business knows that Jason's admonishment practically drips with the cold sweat of a production nightmare; Mark provides additional detail:
After nearly 9 months of trying to make it work we decided to begin the engineering effort to re-work our provisioning architecture, server characteristics, and so on to be able to reliably support persistent data pools on the servers themselves while keeping as many of the features [...] as we could
For me personally, however, the lesson came not from trying to operate a cloud, but rather as one of the enterprise vendors that Mark refers to: The Fishworks appliances found quite a bit of traction in the cloud storage space, and we had several customers who enthusiastically hung eye-watering numbers of their cloud customers off of a single (clustered) head.
The problem was that the scale at which these boxes were operating -- tremendous amounts of data churn over many, many virtual machines -- meant that any hiccup would be felt across their entire fleet. And there were plenty of hiccups, often (but not always) due to operator error or operator misunderstanding.
As a concrete example, I found myself berated by one particularly angry cloud storage customer because the box was "misreporting" several thousand writes per second when the customer "knew" that any writes were impossible -- that the particular LUNs being identified were "never" written to. While I was able to prove to the customer that the writes were indeed happening (DTrace FTW), I couldn't take it any further: The network divided the symptoms of the problem from whatever was inducing them in their cloud, making it difficult to understand who was doing it or why.
Several days later, I happened to notice that the number of writes per second had dropped precipitously; I inquired if they had discovered the root-cause, and the customer sheepishly admitted that there had been a "configuration error." Obviously not satisfied with that answer after the tongue-lashing I had received on this issue, I pressed for details -- and it was revealed that the file systems had all been mounted on the networked LUNs without "noatime" set. That is, any access to a file updated the access time -- the atime -- on the i-node, turning a gentle read-only workload into a fire-breathing metadata-writing monster.
Now, this is a mistake that anyone could make (certainly), but that the storage system was distributed made it unnecessarily arduous to root-cause. Further, the performance ramifications of this were significantly more acute due to the scale of the system: Because they had consolidated so many virtual machines on so few (centralized) spindles, their operations per spindle were much higher than they would have been had the storage been local. And need I remind: These are disks that we're talking about here. They are still operating under the same principle as the IBM RAMAC circa 1956 -- and their seek times are still counted in the same time units! They do not like having load poured on them, and they punish such demands with queueing delays and non-linear latency. (And lest you feel like venting your frustration, they also don't like being shouted at.)
This whole experience -- and many others like it -- left me questioning the value of network storage for cloud computing. Yes, having centralized storage allowed for certain things -- one could "magically" migrate a load from one compute node to another, for example -- but it seemed to me that these benefits were more than negated by the concentration of load and risk in a single unit (even one that is putatively highly available).
When I began to talk to Joyent, I was relieved to hear that their experiences so closely mirrored mine -- and they had made the decision to abandon the fantasia of network storage for local data (root file systems, databases, etc.), pushing that data instead back to the local spindles and focussing on making it reliable and available. Yes, this design decision was (and remains) a trade-off -- when local data is local, compute nodes are no longer stateless -- and Joyent has needed to invest in technologies to allow for replication, migration and backup between nodes.
But in the end, the loss of functionality has been more than made up for by the gain in resiliency. Now when we have I/O performance problems (and, let's be honest, where there's I/O there are I/O performance problems), we know that the failure is contained to a single node and a bounded number of tenants -- and we have rich tooling to diagnose the root-cause (which is often up-stack).
Moreover, by keeping local I/O local, we have all of the necessary context to make decisions around I/O scheduling to assure Quality of Service -- which has enabled us to take a swing at multi-tenant I/O throttling. This is not to say that network storage doesn't have its uses of course -- but frankly, serving up virtual devices to a cloud just isn't one of them: When local storage is consolidated, risk is consolidated along with it, and component failure is more likely to balloon into systemic failure.
More generally, network storage embodies a key design issue in architecting and operating a cloud: The tension between economies of scale and overconsolidation of risk. The challenge of the cloud is delivering enough of the former to be viable without so much of the latter as to be deadly. We at Joyent have learned much of this the hard way, and we reflect that wisdom not only in the way we operate our public cloud, but also in our SmartDataCenter offering. We will no doubt continue to learn ways in which we may better strike that balance, but we take solace in all that we have already learned -- and hope that future education need not be forged in the kiln of unspeakable pain!
Visit http://www.joyentcloud.com/ to learn about how Joyent's public cloud, Joyent Cloud, is different from Amazon's AWS.
Photo by Michael Zimmer.