If you have been watching the news lately you would have seen the damage created by a series of violent storms in the Northeast. In Virginia where Amazon Web Services (AWS) hosts its east region datacenters, the main utilities service provider reported interrupted services for over 900,000 customers due to 80 mph winds and storm damage. A major pro golf event was forced to block fans from watching the event due to a large amount of trees that had fallen the previous day.
“No one was to be allowed on the AT&T National course Saturday in Bethesda, Md., outside players, security, tournament workers and media. Players’ families were also allowed on.” Source: ESPN
Amazon reported service disruption to some of its services in one zone in the East Region. Many reporters and bloggers falsely reported that AWS lost power to its datacenter causing numerous posts questioning the stability of public clouds. The truth is that Amazon’s backup power sources kicked in but not all compute resources failed over successfully. The impact was that a subset of virtual servers was knocked offline for a period of time until AWS was able to restore them. How a customer deals with that use case determines whether their applications go down or stay resilient. Many sites went down. Inmar did not miss a single transaction due to the AWS outage.
Keeping the lights on during disasters
How do we keep avoiding down time? We expect every server and every service in our platform to fail at some point and design for ways to continue to process transactions on redundant compute resources in multiple zones. In other words, we expect zones within regions to fail and design our platform to not be dependent on a single zone.
It takes more than redundancy across zones
Some very popular web sites suffered outages due to the power issues in the Northeast. Many of these same sites did design for redundancy across zones. So why were they down? I cannot speak for these companies and I am sure they will blog a post mortem to explain what happened and how they will prevent it from happening in the future. What I can explain is how we survived it. We have been deployed on AWS since 2009. We have seen five AWS outages and have never been down as a result of these outages. With every outage, we carefully watch what our peers report and learn from what other companies did right and did wrong. One pattern I have noticed with these outages is that AWS’s RDS service, a service for automating database administration processes, seems to always go down when Amazon has issues. We have always been a huge fan of RDS and it is on our roadmap to evaluate, but I have always felt it was too risky for our uptime requirements to implement. We will continue to monitor RDS, but the fact that we manually manage our MySQL databases has been one of the many reasons why we have stayed resilient during these outages. Had we been reliant on RDS, we may not have been so lucky. Does that mean AWS customers should not use RDS? No. We may still use it for certain features that do not require extremely high Service Level Agreements (SLA) and provide real time connectivity to Point-of-sale systems.
Is the public cloud too unstable?
Absolutely not! Amazon has done a tremendous job providing us with infrastructure on demand across numerous zones and regions. It is up to us to design for zones and regions to fail. Amazon has a 99.95% SLA for each zone within each region. They have never had multiple zones down within a region and have never had multiple regions down at the same time. In essence, they have provided us with 100% uptime for compute resources. It is up to us to architect a system to take advantage of multiple zones and regions. Our work is not done here but we are a long way down the road for providing high availability and reliability.
Are the companies that had down sites bad at architecture?
No. In fact, some of the big names that are being singled out in articles have some of the most impressive and advanced architectures ever seen in a high scaling environment. Architecting for uptime is all about risk management, priorities, and investments. A free social media site who has no SLAs to meet may choose to invest more time in scaling to millions of concurrent users and risk going down for an hour or two. It may be a better investment for them to handle surges in traffic than to focus on the rare event of AWS having an outage. Nobody ever died when they could not post a picture to Facebook. On the other hand, a company like Inmar may invest more heavily in disaster recovery because our retail customers expect every shoppers’ transaction to work flawlessly. Our scale is much more predictable than a consumer facing social media site so our roadmap can make easier choices between scalability and reliability. If our site needed to handle a surge of 5M users on any given day, then we may have invested more in scalability and a little less in disaster recovery. It is all about tradeoffs.
Don’t be an armchair quarterback
I have read a ton of articles, blog posts, and comments that are both very negative about cloud computing and the companies that were impacted. One commenter declared that every CTO and system administrator from a company whose site went down should be fired. It is real easy to sit on a couch and throw stones without any facts. Netflix is the most advanced solution ever built in the public cloud. They have actually built agents called the Chaos Monkey that purposely force parts of the architecture to fail in production so that they are always testing for failures. Yet they still went down. Should we fire these guys? Absolutely not! We should learn from them. They are the leaders in the space. What we need to understand is that many companies in the Virginia area who built their own datacenters were down too and some still are. Power outages happen. Data centers fail both in the cloud and on-premise. Everything fails eventually. The secret to uptime is how you design for these failures.
To learn more about Inmar’s digital promotion network that uses cloud computing to redeem coupons at the point-of-sale, visit our website to see how digital is done right.