Late last week we experienced the third significant AWS outage
in the past 14 months, as reported by Justin Lee in
his report Heroku, Pinterest Among Sites
Knocked Offline in Amazon Data Center Outage on
theWhirr magazine:
An Amazon data
center in Ashburn, Virginia suffered a power outage at 9:45 p.m. PDT on
Thursday, causing some websites using AWS cloud technology to go
offline. High-profile websites like Heroku, Pinterest, Quora and HootSuite
saw downtime, as well as many smaller sites.
In this post I'd like to briefly address the lessons from this
experience, and more importantly, focus on the lessons from this sort of
failure and their effect on public PaaS offerings, such as Heroku.
Lesson from the Heroku Outage: Choose the Right PaaS for the Job
One of the promises of PaaS is productivity. PaaS providers
like Heroku claim to increase productivity by abstracting the user from
the details of the underlying infrastructure, and even go so far as to claim
that PaaS makes application operations redundant.
Lesson 1: Choose the Right PaaS for the Job
The main lesson from this outage is that relying on the PaaS
provider to carry all your operations isn't always a safe bet. When we move to
PaaS we still need to understand how they run their disaster recovery,
high-availability, and scaling procedures. Heroku-like PaaS also forces
you to a lowest common denominator approach to dealing with continuous
availability and scalability. In reality, however, there are many
tradeoffs between scalability, performance, and high availability.
The best fit between those tradeoffs tends to be application specific, so
compromising on a lowest common denominator could be
less productive and more costly at the end of the day.
Which brings me to the last point in this section -- PaaS was
meant to provide a higher productivity for running our apps on the cloud by
abstracting the details of how we run our application (the operation) from the
application developer. The black-box approach of many of the public PaaS
offerings, such as Heroku, is often an extreme measure in this regard. There is
often a close coupling between what the application does and the way we run it.
A new class of Open PaaS platforms such as Cloudify, CloudFoundry,
and OpenShift offer a different open source alternative that gives you more
control of the underlying PaaS platform. Cloudify takes it even further,
providing an open recipe model that integrates with the likes of
Chef, enabling you to easily customize and control your operations without
affecting developer productivity.
Lesson 2: Database Availability Must Address Datacenter Failure
The other area that Heroku, and to be honest, most other PaaS
offering don't adequately address is database high-availability, which is
obviously a tough area. Specificaly, in the event of data center failure or
availability zone failure, as in the present case. To deal with database
availability, it is necessary to ensure real-time synchronization of the
database across sites. The example at the bottom of this post refers to a
specific way this can be done, between two mySql instances running on Amazon
and Rackspace with Cloudify.
Lesson 3: Coping with Failure, Avoiding a Single Point of
Failure
The general lesson from this and previous failures is actually
not new. To be fair, this lesson is not specific to AWS or to any cloud
service. Failures are inevitable, and often happen when and where we least
expect them to. Instead of trying to prevent failure from happening we
should design our systems to cope with failure.
The method of dealing with failures is also not that new -- use redundancy, don't rely on a single point failure (including a data center or even a data center provider). Automate the fail-over process, etc...
The method of dealing with failures is also not that new -- use redundancy, don't rely on a single point failure (including a data center or even a data center provider). Automate the fail-over process, etc...
Haven't Learned from Past Lessons?
The question that comes out of this experience IMO is
not necessarily how to deal with failures (those lessons are as old
as the mainframe or even older), but rather -- why are we failing to implement
the lessons? Assuming that the people running these systems are among the best
in the industry makes this question even more intresting. Here is my take
on it:
· We give up responsibility when we
move to the cloud: When we move
our operations to the cloud, we often assume that
we're out-sourcing our data center operation completely,
including our disaster recovery procedures. The truth is that when we move to
the cloud we're only outsourcing the infrastructure, not our operations, and
the responsibility of how to use this infrastructure remain
ours.
· Complexity: The current DR processes and tools were designed for a
pre-cloud world, and do not work well in a dynamic environment, such as the
cloud. Many of the tools that are provided by the cloud vendors (Amazon in
this specific case) are still fairly complex to use.
Implementing Past Lessons in a Cloud World
The first point is is easy -- we need to assume
full responsibility for our applications' disaster recovery
procedures, in the cloud world just as if we were running our own data center.
The hard part in the cloud world is that we often have less visibility,
control, and knowledge of the infrastructure, which affects our
ability to protect our applications -- and each sub-component of our
application -- from failure. On the other hand, the cloud enables us to
spawn new instances easily on various data center locations,
a.k.a Availability Zones.
And so, most failures can be addressed by moving from the failed
system to a completely different system regardless of the root cause of the
failure. Therefore, the first lesson is that in the cloud world it is easier to
implement disaster recovery plans, by moving our application traffic to a
completely different redundant system in a snap, rather than trying
to protect every component of our application from a failure. If we're willing
to tolerate a short window of downtime, we can even use an on-demand backup
site rather than pay the consistent cost and overhead of maintaining a hot
backup site.
Which brings me to the next point: What do we need to build such
a solution?
A consistent redundant environment that is ready to take
over in case of failure needs to include the following elements:
· Workload Migration:
Specifically, the ability to clone your application environment and
configuration in a consistent way accorss sites, and on demand.
· Data Synchronization: The
ability to maintain a real-time copy of the data between two sites.
· Network Connectivity:
Enabling the flow of network traffic between two sites.
Which leads to the
second challenge: Complexity. Here, I would use an example of a
simple web-app and show how we can easily create two sites on demand. I would
even go so far as to set this environment on two separate clouds to
show how we can ensure an even higher degree of redundancy by running our
application across two different cloud providers.
A Step by Step Example: Fail-Over from AWS to
Rackspace
In this example, we picked Amazon and
Rackspace as the two target sites. The same solution would also work
between two availability zones in Amazon or data
centers. We've also tried the same example with a combination of HP
Cloud Services and a flavor of a private cloud.
The example demonstrates a very simple web
application with global load-balancer (Rackspace), and a Web application (Pet
Clinic) based on Tomcat as the Web front end and MySQL as the database.
On both ends we used GigaSpaces XAP Transactional WAN
replication as
a replication channel between the two instances of MySQL and Cloudify to
handle the workload migration between the sites.
The Goal: Fail-over with no change to the
target application
The goals that we set for ourselves were
seamlessness and no change to the target application or database. We achieved
this by plugging the replication service into the existing instances of MySQL.
The replicating service listened to the MySQL events and replicated every
change to its peer MySQL instance. Cloudify enabled us to clone the same
application in both Amazon and Rackspace while maintaining a consistent
configuration setup as well as consistent scaling and fail-over SLAs. Cloudify
does this by abstracting all the information through portable recipe
definitions. Cloudify wraps the application instances with its management and
control services based on the definition provided in the recipe. This
enabled us to clone the environment as well as add elastic scaling without changing
the target application (Pet Clinic in this case).
You can read the full details, including code
references on Github, in Dotan Horvits' blog post: AWS Outage Thoughts on Disaster
Recovery Policies.

.jpg)
thanks for posting such a useful article!!
ReplyDeleteWebsite Development | Hire PHP Programmers
Indeed there are a lot of lessons to learn from the failure to maintain its power. You've really laid-out the facts nicely. It's interesting to hear from your opinion about how to avoid the situation again because it did affected Amazon significantly.
ReplyDelete