Accelerating the
Industrial Internet of Things

Analysis: Rethinking cloud architecture after the outage of Amazon Web Services

Published on 03/07/2017 | Technology

95 0

Brian Guy

Research Analyst . IoT ONE

Overview

Thanks to innovation from companies such as Amazon with AWS, Microsoft with Azure, and Google with Google Cloud Platform (GCP), organizations of all sizes are today increasingly more agile and competitive. Cloud provider partners like Dev9 enable organizations to optimize their journey to the Cloud.

But on Tuesday, February 28, 2017, many people found that their smart phone applications were no longer working properly, many web sites were down and the Internet in general just seemed broken. This is what happens when AWS, the largest Cloud provider, experiences a “service disruption.”

What makes this past week’s outage unique is that unlike prior outages, “service disruptions” or “service events” as Amazon calls them, this week’s web site outages and mobile application failures were not the result of organizations not following Amazon’s best practices, otherwise known as the “Well-Architected Framework.”

In prior AWS outages, such as the 2016 “Service Event in the Sydney Region” where an entire Availability Zone (AZ) failed, organizations that followed Amazon’s Well-Architected best practices were not negatively impacted. This 2017 outage will no doubt cause Amazon to reassess its Well-Architected Framework and introduce new best practices focused on S3 availability.

Indeed, even the AWS Service Health Dashboard (SHD) itself was impacted due to its dependency on S3 in a single region. Amazon has now re-architected its dashboard to be multi-region.

What Is the AWS Well-Architected Framework and Why Does It Matter?

The success of AWS is largely dependent on the success of its customers. If customers do not architect and implement optimally, it hurts the reputation of AWS. If customers have outages, poor performance, very high spend or security issues on AWS, this similarly hurts AWS.

To help its customers and by extension help itself, AWS introduced the Well-Architected Framework on October 2, 2015. The AWS Well-Architected Framework initially focused on four pillars:

•  Security

•  Reliability

•  Performance Efficiency

•  Cost Optimization

In November 2016, after a year of thousands of reviews carried out by AWS Solutions Architects, a fifth pillar was added:

•  Operational Excellence

For startups and smaller organizations that do not have a significant investment in on-premises hardware, deploying to the Cloud is often the default decision in 2017. The agility and flexibility of Cloud computing, combined with the lower startup costs, help these organizations spend less time and money on infrastructure and computing resources.

For large enterprises, however, deploying to the Cloud is a significant incremental cost until redundant on-premises resources are retired. This process can take years, resulting in higher costs in the near term over 3-5 years (or more).

Moreover, the paradigm shift from capital expenditures to higher operating expenses (CapEx to OpEx) can face internal political pressure that can slow down the entire process of migrating and modernizing legacy applications, while also innovating new applications in the Cloud.

Smaller, more nimble competitors instantly have access to the global infrastructure of AWS or Azure, thereby eliminating a prior significant barrier to entry.

These new competitors force enterprise customers to adopt the Cloud (despite it being an incremental cost in the near term) in order to innovate and obtain the agility required to effectively and efficiently compete in 2017. Amazon’s Well-Architected Framework helps these organizations set up for success in the Cloud.

 

What is Amazon S3?

Amazon Simple Storage Service (S3) is object storage. In modern computing, storage is typically divided into being file level storage, block storage or object storage.

File level storage is found on Network Attached Storage (NAS) and typically works in conjunction with a protocol such as SMB (think Windows shares) or NFS (popular in Unix and Linux environments). Amazon Elastic File System (EFS), which is similar to NFS, and Azure File Storage, which uses SMB, are examples of Cloud-based file storage.

Block storage is what you find in your PC or local storage on a server, and it usually – but not always – includes a file system (e.g., NTFS, FAT32, ext3, Btrfs) on top. Some database servers, such as Microsoft SQL Server and Oracle Database, are capable of writing directly (referred to as a RAW partition) without needing the overhead of the file system on top.

In AWS, local instance storage (also called ephemeral storage) and Elastic Block Store (EBS) are examples of block storage. In Azure, Premium Storage is an example of block storage on SSD. Storage area networks (SANs) also use block storage.

Object storage, unlike file level storage and block storage, does not need to be accessed via an operating system like Linux or Windows. It can be accessed directly via APIs or via http(s), making it optimal for web applications.

As storage costs dropped, as megapixels on cameras and phones continued to increase and as users started wanting to store and share gigabytes and even terabytes of large objects, object storage met a need that is not efficiently met by block storage or file level storage. While block storage is excellent for operating system files, relational database records and Office documents, it is not optimal for a feature-length HD movie (think Netflix’s needs).

Object storage allows Netflix to store its movies, allows photo sharing sites to store your photos, allows music streaming services to store their songs, allows iCloud to store a backup of your iPhone, allows video game publishers to store their games for download…and much more.

Amazon S3 is Amazon’s object store. In Azure, this service is referred to as Blob Storage (blob = Binary Large Object).

Why Did So Many Things Break When Just Object Storage Had a Problem?

AWS promotes a best practice of moving all web static content – such as images and style sheets – off of more expensive EC2 instances and on to S3. Amazon EC2 (Elastic Compute Cloud) is Amazon’s name for a virtual server. EC2 expense is frequently a large portion of an organization’s AWS spend, so offloading work from EC2 to S3 can be a best practice for cost optimization.

For example, a web site running on a fleet of EC2 instances might link to S3 for all of its images and other static content, and it might rely on EC2 itself only for dynamic content creation. This removes load from the EC2 instances, thereby potentially decreasing the number of EC2 instances needed and/or decreasing the size and specifications of the EC2 instances.

Similarly, if a content delivery network (CDN) such as CloudFront, Azure CDN or Akamai can pull static assets off of S3 instead of from EC2 instances, this reduces load on more expensive virtual servers.

In addition, for static web pages that only have client-side scripting and do not need server-side dynamic content, the entire page can be hosted on S3. In other words, S3 can even act as a simple web server, completely removing the need for any EC2 instances.

Lastly, many AWS services are dependent on S3. Therefore, when S3 is down, other AWS services may not work as expected. According to Amazon, “Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”

The US-EAST-1 (N. Virginia) region is one of the most heavily-used regions in the AWS global infrastructure, so the outage occurring in this region likely impacted a higher number of customers than if the outage had occurred in a different, smaller region.

Why Weren’t Some Sites and Applications Impacted?

It is misleading to state that AWS was down or that the Internet was down when in fact a single AWS service in a single region was experiencing a disruption.

This is a trend where the media might report the outage being larger than it is, primarily due to so many popular web sites and mobile applications being impacted. The fact is that an optimal multi-region architecture would not have been impacted by Tuesday’s S3 outage.

When a single Availability Zone (AZ) in the Sydney region experienced an outage in 2016, optimally architected sites and applications continued to run as designed. But popular sites and applications that were not designed optimally indeed triggered reports that all of Amazon – even that the Internet itself – was down in Australia.

In addition, sites and applications that use S3 in a different region were not impacted since the outage was isolated to a single region. And sites and applications that do not rely on S3 were not impacted, unless they relied on another AWS service that was impacted by the S3 outage.

What Happened? What Caused This Rare S3 Outage?

Amazon touts S3 has having an impressive 11 nines (99.999999999%) of durability, so how could this happen? And what does 11 nines really mean?

We first need to differentiate durability versus availability. Durability refers to the lack of data loss, whereas availability refers to the data being available.

Eleven nines of durability effectively means you are very unlikely to lose any data on S3 – even if you use S3 in a single region – if you choose the Standard storage class.

The Reduced Redundancy storage class decreases durability to 99.99%. It is assumed there was no data loss in the February 2017 S3 outage. Amazon disclosed that a small number of customers lost data from EBS volumes during the Sydney outage in 2016.

While durability refers to the risk of data loss, availability refers to the risk of an outage or the service not being available. S3, when used in a single region, is “designed for” 99.99% (4 nines) of availability and includes a service level agreement (SLA) for 99.9% (3 nines) of availability.

By comparison, EC2 provides an SLA of 99.95% (what I call 3½ nines) when deployed to at least two Availability Zones (AZs).

 

The full article can be viewed here.

Feature New Record
RELATED CASE STUDIES
RELATED GUIDES