April 24, 2011 - by joyentlindsay
This post is one in a series discussing storage architectures in the cloud. Read Network Storage in the Cloud: Delicious but Deadly and On Cascading Failures and Amazon’s Elastic Block Store for more insight.
Like anyone who has anything to do with the business of running applications attached to the Internet, I can't help but feel uneasy and stressed when an outage as severe and impacting as the one Amazon Elastic Block Store has just experienced happens. The fact that Amazon is a competitor doesn't matter. It just sucks for everyone. This post, however, isn't about EBS. Not really. It's about abstractions. In particular, the abstraction of "the disk."
Abstraction levels are something that have fascinated me for my entire career. We (developers, operators, the whole lot of us) heavily depend on multiple abstraction layers in our respective stacks simply because the technology we're putting into play is so incredibly complex that there's no way we could progress without the ability to hide details.
The trick, of course, is getting the abstraction right. It has to be understandable, observable, and predictable. A fantastic implementation is worthless if the abstraction is wrong. Change the behaviour of the abstraction you're implementing too much, and the abstraction becomes worthless. You can tell where I'm going with this - questioning whether or not the abstraction of a reliable, super cheap, performant network attached block storage device is viable.
Certainly there's a lot of players in this space with a lot of skin in the game who want us to believe it is. I'm not talking about just Amazon - the top 5 storage vendors of EMC, HP, IBM, NetApp, and Dell shipped 3397 petabytes of the stuff last quarter.
There's a phrase about abstractions (I honestly don't know if it was myself or someone else who coined it) that I remind our engineering staff about on a regular basis. It goes something like this:
We need our abstractions, our black boxes. Black boxes are great and enabling when they work. But when something goes wrong, the walls of our abstractions need to go transparent as fast as possible.
There's a reason we so fanatically drive at this goal at Joyent, and it's not because we're smarter or better than anyone else. It's because we've been burned by invalid abstractions of our own doing so many times that we've programmed a Jabber bot to constantly berate us with reminders of how our hubris around abstractions led us to failure.
The object store and multi-tenant message routing architecture resulted in a body of code and interfaces that (it was open source from day one) nobody but us could really understand well enough to contribute to. Or debug it. In fact, the object store itself tripped us up so many times dogfooding the platform that we finally shook free of its hold and reevaluated our approach. The implementation had issues, but the real culprit was the set of abstractions that Smart Platform presented.
Today we're the driving force behind NodeJS, which sits on a very different place on the abstraction curve than Smart did. We've pulled lots of ideas and lessons from Smart into our hosted NodeJS offering, no.de, but we did drop as much of the magic as possible. The end result is a service that's tremendously simpler and in many ways less feature rich, but it doesn't make promises we can't fulfill and our customers can understand it.
In fact, when something goes wrong they can log right in with a shell and see exactly exactly what's going on. We've paired the transparency with ground-breaking observability tools. NodeJS and no.de are huge successes, largely, I think, because we walked the fine line of abstraction level better this time. It's only a black box when you want it to be. When you need to look inside, you can. You can evaluate the risk level of the technology and approach yourself.
The promise of network block storage is wonderful: Take a familiar abstraction (the disk), sprinkle on some magic cloud pixie dust so that it's completely reliable, available over the same cheap network you're using for app traffic, map it to any instance in a datacenter regardless of network topology, make it so cheap it's practically free, and voila, we can have our cake and eat it too! It's the holy grail many a storage vendor, most of whom with decades experience in storage systems and engineering teams thousands strong have chased for a long, long time. The disk that never dies. The disk that's not a disk.
The reality, however, is that the disk has never been a great abstraction, and the long history of crappy implementations has meant that many behavioral workarounds have found their way far up the stack. The best case scenario is that a disk device breaks and it's immediately catastrophic taking your entire operating system with it. Failure modes go downhill from there. Networks have their own set of special failure modes too. When you combine the two, and that disk you depend on is sitting on the far side of the network from where your operating system is, you get a combinatorial explosion of complexity.
To make matters far worse, you can't easily flip the switch and make the walls of this Magic Block Store go transparent so you can see what's happening under the covers. EBS is no different than the big storage vendors in this regard -- neither Amazon nor EMC seem to want us to know what goes on inside their abstraction.
I certainly don't claim to know how EBS works, but of course people go to bars and have beers and talk. It's commonly believed that EBS is built on DRBD with a dose of S3-derived replication logic. As a former boss of mine so demeaningly stated to me once, in grown-up land when your cheap home grown DRBD solution falls over and loses all your data you open up your check book and pay EMC or NetApp to do the job right. Maybe Amazon is different. Maybe they did what a dozen billion dollar companies before them tried to do and never pulled off. Or maybe EBS is indeed bandaids and chicken wire. I have no idea. Which is a problem, as a user of EBS.
Back to the abstraction instead of the implementation. Do we need block devices in the cloud? If we're just asking too much of this ancient abstraction, as I believe we are, can we as developers and operators stop demanding the vendors make these Magic Block Store solutions? We certainly fear having to evolve our trusted abstractions, and the disk is about as old of an abstraction as we've got.
All is not lost, however. In my experience, it isn't that we need "disks", it's that our apps need POSIX's read() and write() and friends combined with behaviour that roughly approximates a physical disk. For the times we actually need a block interface vs. filesystem semantics, I would argue, the abstraction sitting on top (the filesystem, usually) is even more sensitive to the behaviour of the disk device than a POSIX application like a database is.
My opinion is that the only reason the big enterprise storage vendors have gotten away with network block storage for the last decade is that they can afford to over-engineer the hell out of them and have the luxury of running enterprise workloads, which is a code phrase for "consolidated idle workloads." When the going gets tough in enterprise storage systems, you do capacity planning and make sure your hot apps are on dedicated spindles, controllers, and network ports.
It was fantasy believing it was possible to pull off a centralized network block storage service in a multi-tenant cloud without any of the architecture shenanigans our enterprise brethren do and think that applications, databases, and business could depend on its being perfect. Honestly, we should have know better. We the applications developers asked what is perhaps the crappiest of all abstractions in computers to solve all of our availability problems for us. We asked for magic. Clearly, the vendor never should have made the promise of magic, but everyone is to blame for this continued expectation that such magic is possible.
A Different Approach
At Joyent, when VMware and Amazon were preaching centralized network block storage as the salvation, we tacked the other way, for reasons I'll explain below. (Hint: We tried and failed.) We simply provide a POSIX filesystem interface to storage, which happens to be sitting on local, not network, block devices. We lean on ZFS for durability and a fighting chance when things "go byzantine." We use fast SAS drives in RAID groups with multiple parity stripes. We don't pretend that this "disk" can never, ever go away or survive all failure modes. What's more important, I believe, is the set of technologies we put in place so that our customers can transparently see exactly how our filesystems are holding up.
As an example, we provide tools and scripts that show you a distribution of how much latency your MySQL database is seeing from the abstraction directly beneath it, the filesystem.
You can run it yourself from the command line, the output looks like this:
# ./mysqld_pid_fslatency.d -p `pgrep mysqld` Tracing PID 7357... Hit Ctrl-C to end. ^C MySQL filesystem I/O: 21329; latency (ns): read value ------------- Distribution ------------- count 1024 | 0 2048 |@@@@@@@@@@@@ 3837 4096 |@@@@@@@@@@@@@@@ 4998 8192 |@@@@@@@@@@@ 3540 16384 |@@ 745 32768 | 95 65536 | 19 131072 | 3 262144 | 0 524288 | 0 1048576 | 1 2097152 | 1 4194304 | 0 write value ------------- Distribution ------------- count 4096 | 0 8192 |@@@@@@@@@ 1902 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 5181 32768 |@@@@ 798 65536 |@ 184 131072 | 9 262144 | 1 524288 | 6 1048576 | 0 2097152 | 0 4194304 | 7 8388608 | 2 16777216 | 0
Or you can use our visualizations:
Either way, it's pretty awesome. Being able to confirm your DB is seeing thousands of cache hits with microsecond latency and only a few disk hits in the 1-10ms range is the kind of insight that lets DBAs sleep at night. You can also easily ask and receive an answer to a question like "how many of my slow queries are because of filesystem/disk latency vs. a missing index?" Trying to use a tool like iostat against a shared, network provided block device to figure out what your level of service your database is getting from the filesystem below it is an exercise in frustration that will get you nowhere.
For Joyent, keeping the network out of the storage solution and safely persisting those filesystems across reboots on the servers added some complexity to our provisioning. We don't have the luxury of treating local disk as ephermeral when selecting nodes to place workload on, for example. We lose some nice features like quick re-mounting to arbitrary vm instances, too. But it allowed us to present a usable, observable abstraction that we're continuing to improve with innovative I/O throttling and QoS. We're building a better building block, in our opinion. The complexity of tackling this work with an oversubscribed network in the middle would have been, in our opinion, an impossible engineering challenge.
Joyent, like always, arrived at the conclusion that we could not build a multi-tenant cloud on network block storage the hard way: We tried to build a multi-tenant cloud on network block storage. When we launched the Joyent IaaS product, it was in fact built on replicated storage nodes that presented iSCSI targets to the server hosts that then in turn implemented RAID1 mirrors of the network block devices. The hosts then carved up the LUNs into ZFS datasets which mapped to the virtual machines.
It's worth explicitly pointing out that we had none of the flexible provisioning APIs and features that EBS currently exposes. I'm certainly not pretending we had an EBS competitor back then. But on the backend, the implementation was probably not wildly different, and the core abstraction we attached our customers virtual machines to was nearly identical. Our MBS implementation was originally built on the StorEdge Availability Suite, a product that came out of StorageTek/Sun that works very similarly to DRBD. We were never able to get it to work well enough, largely a function of us not having enough resources to pour into enhancing the base product to cope with scenarios we were seeing a lot when competing I/O from multiple hot tenants caused the replication to fall behind and never catch up. We instead invested in filers from NetApp and EqualLogic (we didn't want all our eggs in one basket) and that bought us time to take engineering and ops resources off keeping the storage nodes running and work on other parts of our stack.
Ultimately, however, the approach suffered the same fate as the homegrown and the performance curve would degrade so badly over time that we were at times only attaching and handful of servers to a replicated filer pair. Obviously, this was not only uneconomical but unviable from a predictable performance point of view; Our customers hated the I/O experience. After nearly 9 months of trying to make it work we decided to begin the engineering effort to re-work our provisioning architecture, server characteristics, and so on to be able to reliably support persistent data pools on the servers themselves while keeping as many of the features commonly associated with MBS as we could (snapshots we have today, async replication isn't that far off).
I still believe that it was the right decision, and as we have grown up as a company and increased our engineering capabilities we're only now really beginning to see how liberating the decision to drop network storage out of the Joyent cloud was. I believe that applications can rise to the task and not require perfect, always on block storage, for the simple fact that they're going to have to do it anyways to get to the goal of global deliverability and availability. We just can't lean on the false abstraction of a perfect disk forever.
Visit http://www.joyentcloud.com/ to learn about how Joyent's public cloud, Joyent Cloud, is different from Amazon's AWS.
Photo by mcbarnicle.