Home / Blogs

Outages Never Sleep!

Not matter how much robustness and redundancy you build around your multi-tiered infrastructure you are bound to suffer outage(s). I'm not implying the failure of a single server, but a complex outage that's usually external to the operation of the infrastructure. What matters is how you communicate outage notification when things do go awry. I think the words that I'm searching for are transparency and openness.

We've seen over and over again the lack of notification and continual updates that either gets overlooked or ignored during an outage. I do understand the first and foremost goal for any organization during an outage is to stabilize their network infrastructure but what's even more frustrating is lack of communication channel (status blog, public health dashboard, etc) where customers have no clue let alone getting through the NOC to get a straight answer.

What matters to most, including myself is how an outage analysis is communicated during downtime events in a timely manner and the outcome of such event that leads to lessons learned (post-mortem) as it serves as a great example of how not to do things.

The intention here is not to criticize companies, but quite the opposite; those who have chosen to publicize the causes of outages should be applauded by customers for being open. They have shared their efforts to learn and improve the availability, robustness, scalability, and performance of network services. After all, every ISP encounters the same challenges.

Fred Brooks nicely articulated:

You can learn more from failure than success. In failure you're forced to find out what part did
not work. But in success you can believe everything you did was great, when in fact some parts may not have
worked at all. Failure forces you to face reality.

However, the "real" costs include some losses that are harder to quantify but may be far greater. For example,

  • Lost revenue from dissatisfied customers moving to competitors or taking new business to competitors
  • The cost of a tarnished image; the lessened ability to credibly market future "premium" differentiated services and position them against competitors.

There seems there can be three simple rules of outages:

  1. Things break, design for that inevitability
  2. Everyone recognizes rule #1,
  3. Keep your customers and community informed which matters a lot to all and by being open and honest the stress of taking a new approach to communications is easily outweighed.

Keep in the mind a good deal of what I've outlined here will seem a lot like common sense which is a good thing. Quite often the simplest approaches to problem solving are the best ones, and openness and transparency is no exception.

There, got that off my chest.

Full Disclosure: Moderator for wiki.outages.org and I'm doing this to provide transparency to what I do and why I do it.

By Virendra Rode, Network Consultant

CircleID Newsletter The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

VINTON CERF
Co-designer of the TCP/IP Protocols & the Architecture of the Internet

Comments

Technical and Emotional Solutions By Alex Tajirian  –  Sep 21, 2010 10:11 am PDT

Although I read it between the lines, you should make risk management more explicit. Service providers need to have in place monitoring and operating procedures to handle such events (including unknown unknowns), which require, as you note, technical and customers’ emotions management solutions. Unfortunately, however, service providers, in general, ignore the emotional side.

With risk management, the losses that you note would be minimized and, when “properly” implemented, can result in sticky customers. Moreover, you can use risk management successes as emotional springboard stories.

Hi Alex,The 64 bit question is, how By Virendra Rode  –  Sep 21, 2010 1:47 pm PDT

Hi Alex,

The 64 bit question is, how can we engage and /or encourage providers to be more forthcoming and report outages w/o being concerned about bottom line and instead putting their customer's interest first? I will even go on a limb and say this, its matter of time heavy handedness of government aka "regulation" will force companies into a corner if things continues when it comes to close door outages reporting and this will further diminish "free market".

Now, I am confused. The original post By Alex Tajirian  –  Sep 21, 2010 4:20 pm PDT

Now, I am confused.

The original post suggests that you are an advocate of technical and emotional solutions. If true, emotional solutions should improve their bottom line. But in the comment you say, “being concerned about bottom line and instead putting their customer's interest first.” I am pointing out that emotional solutions improve the bottom line because they put customers’ interest first.

Many solutions providers (whether individuals, corporations, or governments) seem not to understand the value to the bottom line of risk management and the need to integrate technical/ooperational with emotional solutions.

I think we are saying the same By Virendra Rode  –  Sep 21, 2010 9:24 pm PDT

I think we are saying the same thing, and maybe I wasn't clear. We want customer's interest first. IMO, openness and transparency are key to building trust with customers.

Given the reluctance of providers to pagree with you and maybe publicly report their service as “bad”, especially if not everyone has to report on the same basis and/or the measurement is not universally recognized.  Even with the existence of a protective agreement, no one wants to report and how that's defined is a separate discussion for some other day.

Rule #2 By Dan Campbell  –  Sep 22, 2010 8:31 am PDT

Actually, rule #2 is wrong.  I've been in many places where many folks have unreasonable expectations, effectively believing things don't break.  Senior management and sales first and foremost, both of which are typically not technical.  And customers too.  It's actually gotten bad, because most don't understand the cost implications of redundancy and it's downstream impact on pricing that ultimately gets passed through.  It's a losing battle to fight though.  It is what it is, the excpecation is basically the very unreaslistic "always-up" 100% availability.  You just have to suffer your lumps from time to time when things do break, which they definitely will sooner or later.

As Ken Scafer of OpenSRS summed it By Virendra Rode  –  Sep 22, 2010 9:29 am PDT

As Ken Scafer of OpenSRS summed it very well,

"The days of hiding are over. You now have a choice of whether you want to tell the story or have others misrepresent the story on your behalf".

What I'm adovocating here is really simple, openness and those who provide openness simply shows quality of service of their organization.

Add Your Comments

 To post your comments, please login or create an account.

Related

Topics

Domain Names

Sponsored byVerisign

Cybersecurity

Sponsored byVerisign

IPv4 Markets

Sponsored byIPXO

Brand Protection

Sponsored byAppdetex

Domain Management

Sponsored byMarkMonitor

Threat Intelligence

Sponsored byWhoisXML API