Home / Blogs

Live Long and (Do Not) Prosper: Lessons and Reminders from Yesterday's Wikipedia Outage

Tom Daly

Yesterday's Wikipedia outage, which resulted from invalid DNS zone information, provides some good reminders about the best and worst attributes of active DNS management. The best part of the DNS is that it provides knowledgeable operators with a great tool to use to manage traffic around trouble spots on a network. In this case, Wikipedia was attempting to route around its European data center because an over heating problem that caused Wikipedia's servers at that location to shut down. This is a classic example of how important DNS is to disaster recovery planning. Effectively implemented, the failover away from the European site would have been transparent to the user and the 'disaster' would have been averted, showcasing the resiliency of the Internet at its best.

However, unfortunately, it is also a classic example of how devastating even 'small' errors in implementing such a DNS based failover strategy can be to site uptime. The reason small errors can grow into a big problem is because, while it is flexible, the DNS can also be incredibly unforgiving. The root (pun intended) of the issue is that the DNS works by storing information in a relative few authoritative name servers, from which recursive name servers pull and cache information as a result of a user's request for the zone information. Thus, depending on the Time-To-Live (TTL) setting for the zone file at issue, the recursive server could keep the information from seconds to hours before going back to the authoritative server to get an update. Once an invalid zone is pulled to a recursive DNS server, the server won't check back for new information until the TTL expires, which means that bad information can linger long after the zone information is fixed, sending user after user (after user) to the wrong place or to no place at all. This architecture is why, at least with respect to DNS zone information, Spock was wrong—long life is not prosperous.

The good news is that you can get the best from DNS while avoiding the problems caused by its architecture. The key is having a DNS solution that mitigates against both of the risks that caused the Wikipedia outage: the failure to notice (or prevent) the introduction of invalid zone data and the failure to use low latency DNS resolution (or short TTLs). What constitutes a best practice in the area is up for some debate, but our perspective is that, with respect to zone data, using a utility such as 'named-checkzone' for BIND or 'tinydns-data' for DJBDNS is a must as it affords administrators an opportunity to check zone data for errors and correct them before they go live. With respect to TTLs, best practice suggests that an optimal TTL is one that is half of the desired 'mean time to repair' (MTTR)—which is generally determined based upon the sites larger disaster recovery strategy—for the site.

By Tom Daly, Chief Technology Officer at Dynamic Network Services, Inc.. Visit the blog maintained by Tom Daly here.

Related topics: Cloud Computing, Data Center, DNS, Domain Names, Security

Get a weekly summary of postings to CircleID:

 Master Feed (more feeds)      Twitter      Mobile
Bookmark / Email This Post

Comments

DNS Contingency Switcher Paul Roberts  –  Apr 12, 2010 3:30 AM PDT

We have produced a tool that will automate the switching of DNS records for disaster recovery or maintenance purposes. It integrates with Infoblox (using bloxTools) and VitalQIP.

Please see here for more info:

tuscany networks DNS Contingency Switcher

Cheers,

Paul

To post comments, please login or create an account.

Related Blogs

Related News

Other Topics

Access Providers Broadband Censorship Cloud Computing Cyberattack Cybercrime Cybersquatting Data Center DNS DNSSEC Domain Names Domain Registries Email Enum ICANN Internet Governance Internet Protocol IP Addressing IPTV IPv6 Law Malware Mobile Multilinguism Net Neutrality P2P Policy & Regulation Privacy Regional Registries Security Spam Telecom Top-Level Domains VoIP Web White Space Whois Wireless

Industry Updates – Sponsored Posts

Dyn Inc. Acquires EditDNS and Launches Dynect SMB

Afilias' Project Safeguard to Boost Global DNSSEC Deployment by 50 Percent

.ORG, The Public Interest Registry Releases Results of Bi-Annual Domain Name Report, "The Dashboard"

Afilias Announces Judging Panel for 2010 .INFO Awards

Afilias Opens .INFO Awards to Select the Best Websites of 2010

.CO Internet Announces Landrush Auctions for .CO Domain Names

Registrar DNSSEC Implementation Cheat Sheet

Internationalised Domain Names Set to Take Off with Approval of IDNA 2008 Protocol

400,000 .CO's and Counting!

BlueCat Networks Selects Afilias to Power New DNS Offering

Hosting Companies Need Advanced DNS, Here's Why…

Brussels and the Month Afterwards: Celebrations, New gTLD and Security and Stability Issues Ahead

.ORG Inserts DNSSEC Key Into The Root Zone

.CO is "Google-National"

Dyn Inc. Announces Two Strong Network Additions to Support Evolving Client Roster

Leading Registrars Supporting DNSSEC

.CO Domain Names Now Available to the Public

Black Lotus Selects Afilias to Improve DNS Reliability

DNSSEC Goes Inside the White House

.ORG Celebrates its 25th Anniversary