Home / Blogs

Cloud Redundancy: How Amazon Should Repair Credibility

Leonard Grace

Cloud Community Stunned

I'm curiously puzzled, but not entirely surprised, how a company such as Amazon (NASDAQ: GS) allowed its servers to be interrupted for any length of time due to severe storm damage in northern Virginia this past weekend. Companies using cloud servers are both expectant and dependent on being able to pull information from cloud sources to operate their businesses without interruption. After all, IT professionals have been preaching the security and reliability of the cloud for quite some time to manage large data off-site. Steps for Amazon to repair credibility should be transparent and swift.

Redundancy Issues 101

Failing to address possible redundancy issues early on in cloud infrastructure is a basic design-maintenance (101) issue, which can become extremely expensive.

  • Did backup generator design properly address power load requirements in a long-term outage?
  • Was (regularly planned generator testing) implemented for inclement weather situations, or other emergencies?
  • An adequate and properly maintained generator is the first line of defense in outages and should be highest on the maintenance list
  • What surge-protection plans were implemented for grid-spikes which can disrupt and even destroy electronic equipment, or interfere with telecommunication operations?

This issue, reported by news outlets as a downtime factor, should be the addressed openly and honestly with constituents early on. Credibility and believability in cloud support 24/7, is at stake.

Utility Service Provider Design

Obviously the external electric grid design plays a role in any outage, no matter their cause. Early planning stages in design and control are key factors in redundancy, efficiency and reliability.

Did through collaboration exist between utility and customer in facility design process?

  • What redundancy features did utility provide in design phase of cloud site?
  • What site factors led Amazon to believe this area utility was capable of handling unforeseen outages through prevention techniques?
  • How much utility infrastructure is above-ground vs. underground therefore susceptible to damage by weather or other contingency factors?
  • Are back-up substations available to redirect power if local grid goes down?
  • What is the utilities track-record on outages, repairs to infrastructure, and down-time?
  • Where does the cloud site stand in the hierarchy of restoring service, high, medium, low?

Off-Site Redundancy-Backup Facility

Inherently, these type utility outages will occur due to a national infrastructure grid that is aging and vulnerable to costly disruptions. Above ground utility pole grid is notoriously aged and lacks design upgrades to protect critical areas from massive outages. This is a known fact which businesses must work around, building on site and off-site redundancy.

Amazon's cloud services for the Eastern U.S. should have been automatically switched to a redundant system, such as its West Coast Operation. Why this did not happen is a mystery, but Amazon should own up to its design miscalculations and move to inform customers on future plans regarding eliminating down-time.

Image Repair - Epitaph

Move candidly and quickly to report steps to correct outage issues. Hire a PR Firm to manage media campaign designed to restore credibility. Hopefully this is not an epitaph for Amazon Cloud Services. The problem remains that many companies, not just Amazon, are risking their business operations on poorly designed redundancy. Since proper design and maintenance is not a revenue-driven expenditure, sadly it does not get the attention needed. It is a striking example of how an ambivalent policy can get organizations into embarrassing situations which, expectedly, get highly covered by media outlets.

This evaluation of what happened, and its causes, are introspections on Amazon Cloud Services site in Northern Virginia and does not reflect actual events at the time. It is an educated guess as to what could have happened based on public knowledge.

By Leonard Grace, Founder & Editor - Broadband Convergent

Related topics: Cloud Computing, Data Center, Web


Don't miss a thing – get the Weekly Wrap delivered to your inbox.


A more detailed post-mortem can be found Frank Bulk  –  Jul 05, 2012 8:09 PM PDT

A more detailed post-mortem can be found here.  There's a discussion of their generating testing practices.

While it's unfortunate that they had power issues, there's a balance between cost and making sure power does not go out.

Hello Frank,Thanks for the update to this Leonard Grace  –  Jul 06, 2012 9:25 AM PDT

Hello Frank,

Thanks for the update to this post, including AWS review and comment of their outage. It remains clear that AWS customers need to evaluate their specific needs for tolerable downtime and work with AWS on a price/performance solution.

Amazon did the right thing in terms Phil Howard  –  Jul 08, 2012 9:11 AM PDT

Amazon did the right thing in terms of making sure of general availability, and that is by having redundancy.  There are plenty of scenarios where a data center can be knocked out.  What if lightning struck, and destroyed, the generator(s) and/or transfer switch(es)?  They have redundancy called "us-west-1" and "us-west-2" and "eu-east-1", etc.

What we have is a case of customers not fully utilizing Amazon's redundancy infrastructure in their application design and deployment.  I think Amazon should spend their investment in facility improvements in other directions, like setting up more data centers in other locations.  I'd like to see them get "us-east-2" and "us-midw-1" up and running next.  But they also need more redundancy in EU, as well ... "eu-east-1" and "eu-nord-1" should be next.

I do not doubt they will learn exactly what caused this generator and power system failure, and study what can be done to prevent that.  If it is economically feasible to prevent it, then they should.

We don't try to make network connections so 100% reliable against loss or corruption of data.  We engineer around it to make our utilization of network connections not be impacted by this.  Checksums, error correction codes, and retransmission, go a long way to accomplish this in a very economical way.

Customers of Amazon Web Services need to learn how to fit their own applications to the available redundancy infrastructure they buy into.  AWS is not just a place to offload peak demands ... it's the redundancy you'd like to build out for yourself but cannot afford to do.  Even non-cloud-ish hosting services are already providing multiple sites.  You have to know how to make your application deal with it.

To post comments, please login or create an account.

Related Blogs

Related News

Explore Topics

Dig Deeper

Mobile Internet

Sponsored by Afilias Mobile & Web Services

IP Addressing

Sponsored by Avenue4 LLC

DNS Security

Sponsored by Afilias


Sponsored by Verisign

Promoted Posts

Buying or Selling IPv4 Addresses?

Discover ACCELR/8, a transformative IPv4 market solution developed by industry veterans Marc Lindsey and Janine Goodman that enables organizations buying or selling blocks as small as /20s. more»

Industry Updates – Sponsored Posts

Radix Announces Global Web Design Contest, F3.space

.TECH Gets Its Big Hollywood Break

Major Media Websites Lose Audience Due to Slow Load Times on Mobile

DeviceAtlas' Deep Device Intelligence Now Addresses Native App Environment

A Look at How the New .SPACE TLD Has Performed Over the Past 2 Years

Why .com is the Venture Capital Community's Power Player

Miss.Africa Announces 2016, Round II Seed Funding Tech Initiative for Women in Africa

Airpush Chooses DeviceAtlas to Provide Device Awareness to Mobile Ad Network

DeviceAtlas Releases Q2 2016 Mobile Web Intelligence Report, Apple Loses Browsing Market Share

Effective Strategies to Build Your Reseller Channel (Webinar)

Facilitating a Trusted Web Space for Financial Service Professionals

News.Markets: A Rising Star in the World of Financial Trading and New TLDs

Mobile Web Intelligence Report: Bots and Crawlers May Represent up to 50% of Web Traffic

i2Coalition to Host First Ever Smarter Internet Forum

What Holds Firms Back from Choosing Cloud-Based External DNS?

DeviceAtlas Brings Device Awareness to HAProxy

Verisign & Forrester Webinar: Defending Against Cyber Threats in Complex Hybrid-Cloud Environments

Dyn Evolves Internet Performance Space with Launch of Internet Intelligence

Hybrid Cloud Proves Clouds Are Worthy of Email Infrastructure

Verisign OpenHybrid for Corero and Amazon Web Services Now Available