The Four Keys of Cloud Security: Availability

This is the third in a series of blog posts on cloud security by Carlos Cardenas, our Director of Solutions Engineering. Carlos is a security expert who came to Joyent fromThe Institute for Cyber Security ICS at the University of Texas, San Antonio. While at ICS, Carlos worked under Ravi Sandhu, PhD, one of the leading security experts in the world.

In my first post, The Four Keys of Cloud Security, I talked about the four security issues that are important for Cloud Service Providers (CSPs) and their customers. In the first post, I focused on confidentiality. In the previous post, I focused on integrity. In this post, I will focus on Key #3: Availability.

At the end of my previous post, I left off saying that without availability, it doesn't matter what kind of encryption algorithms or authentication methods are used. To better understand availability, let’s review a few events that occurred over the past two years in cloud computing.

Most recently, on Christmas Eve, 2012 December, Netflix went offline. (As an avid Netflix customer, I was fuming I couldn't watch A Charlie Brown Christmas, The Christmas Story or any of the "man" classics like Die Hard 1 and 2, Lethal Weapon 1 and 2, and Rocky IV). This wasn't due to a Netflix failure specifically, but rather happened because ELB went down. This event was caused by a logical portion of the ELB state data was deleted affecting a portion of the ELB service for a prolonged period of time.

Earlier that year, on October 22, the gremlins had taken out Reddit, Foursquare, Pinterest and others social media sites. The culprit: EBS. This event highlights the fragile nature of using a SAN for your cloud.

The last event I'm going to highlight was on June 29, 2012 and involved a "a severe storm of historic proportions." The storm affected AWS and caused wide spread outages on Netflix and friends due to generators not transferring power effectively. It's worth noting that Joyent's US-EAST-1 is adjacent to AWS's US-EAST-1 data center and did not suffer any of these issues during the storm.

What's the moral of these Aesop Fables?

First of all, do not rely on a single CSP for anything that is mission critical.(I believe Netflix falls into this category for all movie buffs.) Secondly, do not rely on a CSPs infrastructure for resiliency. Build this into your application.It's far easier to build your app to withstand failures than to keep pushing the problem down the stack. This is what we advocate at Joyent.

Let me give an example of what some people might think their CSP is doing to recover from a problem.

In the first case, the CSP is unsuccessful:

  • Detect problem in VM (hard to do, but let’s say it's possible) that might be performance related
  • Perform "Live Migration" of the VM to another server (uses SAN to store state)
  • SAN fails
  • Performance related problem just got worse

Or if they are able to recover:

  • Detect imminent problem on server
  • Perform "Live Migration" of the VM to another server (uses SAN to store state)
  • Migration succeeds
  • Repeat

When actually, it’s easier than all that. All they needed to do was to provide that feature in their app such that:

  • VM fails
  • Launch another instance of app

or

  • Server fails
  • Launch another instance of app

Remember, in the cloud, you are essentially in a distributed computing environment where anything that can fail, will fail and that you, the user, must prepare for that and not rely on the provider's technology to prevent or repair the issues.

I think I have beating this dead horse long enough, but the bottom line is: Do not rely on SAN technology for anything that requires robustness on failures, byzantine or not.

What Can Be Done Today

  • If your site is mission critical in nature, use multiple providers. There are plenty of application frameworks that work across multiple CSPs like pkgcloud.
  • Do not rely on your provider's infrastructure to provide fault tolerance. Build it into your applications.
  • Avoid SANs for distributed workloads; nothing good comes from a faulty SAN.

In my next post, I'll talk about the holy grail of Cloud Security: Key #4: Mutual Auditiability and how close we are to obtaining it.



Post written by Carlos Cardenas