Home / Blogs

Live Long and (Do Not) Prosper: Lessons and Reminders from Yesterday's Wikipedia Outage

Tom Daly

Yesterday's Wikipedia outage, which resulted from invalid DNS zone information, provides some good reminders about the best and worst attributes of active DNS management. The best part of the DNS is that it provides knowledgeable operators with a great tool to use to manage traffic around trouble spots on a network. In this case, Wikipedia was attempting to route around its European data center because an over heating problem that caused Wikipedia's servers at that location to shut down. This is a classic example of how important DNS is to disaster recovery planning. Effectively implemented, the failover away from the European site would have been transparent to the user and the 'disaster' would have been averted, showcasing the resiliency of the Internet at its best.

However, unfortunately, it is also a classic example of how devastating even 'small' errors in implementing such a DNS based failover strategy can be to site uptime. The reason small errors can grow into a big problem is because, while it is flexible, the DNS can also be incredibly unforgiving. The root (pun intended) of the issue is that the DNS works by storing information in a relative few authoritative name servers, from which recursive name servers pull and cache information as a result of a user's request for the zone information. Thus, depending on the Time-To-Live (TTL) setting for the zone file at issue, the recursive server could keep the information from seconds to hours before going back to the authoritative server to get an update. Once an invalid zone is pulled to a recursive DNS server, the server won't check back for new information until the TTL expires, which means that bad information can linger long after the zone information is fixed, sending user after user (after user) to the wrong place or to no place at all. This architecture is why, at least with respect to DNS zone information, Spock was wrong — long life is not prosperous.

The good news is that you can get the best from DNS while avoiding the problems caused by its architecture. The key is having a DNS solution that mitigates against both of the risks that caused the Wikipedia outage: the failure to notice (or prevent) the introduction of invalid zone data and the failure to use low latency DNS resolution (or short TTLs). What constitutes a best practice in the area is up for some debate, but our perspective is that, with respect to zone data, using a utility such as 'named-checkzone' for BIND or 'tinydns-data' for DJBDNS is a must as it affords administrators an opportunity to check zone data for errors and correct them before they go live. With respect to TTLs, best practice suggests that an optimal TTL is one that is half of the desired 'mean time to repair' (MTTR) — which is generally determined based upon the sites larger disaster recovery strategy — for the site.

By Tom Daly, President and CTO at Dyn. Dyn is the Internet IaaS Infrastructure-as-a-Service leader that features a full suite of DNS and Email Delivery solutions. Follow on Twitter: @TomDynInc and @dyninc.

Related topics: Cloud Computing, Data Center, DNS, Domain Names, Security

WEEKLY WRAP — Get CircleID's Weekly Summary Report by Email:

Comments

DNS Contingency Switcher Paul Roberts  –  Apr 12, 2010 2:30 AM PST

We have produced a tool that will automate the switching of DNS records for disaster recovery or maintenance purposes. It integrates with Infoblox (using bloxTools) and VitalQIP.

Please see here for more info:

tuscany networks DNS Contingency Switcher

Cheers,

Paul

To post comments, please login or create an account.

Related Blogs

Related News

Topics

Industry Updates – Sponsored Posts

.ORG COO Discusses Priorities With DailyVista, Pursuit of .NGO Domain

StarHub to Acquire '.starhub' New Top-Level Domain

ARI Registry Services Signs 21 Contracts in the First Week of New TLD Applications

MarkMonitor to Exhibit at Internet Tech Policy Exhibition and Reception to be Held on Capitol Hill

Sedari Signs With Dot Moscow Bidders

.ORG, The Public Interest Registry Welcomes Nancy Gofus As Chief Operating Officer

Minds+Machines Works with .bayern

The New Domain For Japan, JP.NET, Launches With Exclusive Invitation to Trademark Owners

Verisign to Award New Infrastructure Research Grants

Being a .PRO When Choosing a Registry Services Partner

UK Cabinet Office Looks to BlueCat Networks' Expertise and Best Practices for Securing PSN

Afilias Acquires Registry Services Corporation, .PRO

Thoughts on Applying for a Generic Top-Level Domain

Sedari Launches "Guess the Numbers Game" for New TLD Program

dot Brand Makes Its Debut: Afilias Advises Companies to Act Now for Successful TLD Applications

BlueCat Networks Helps Organizations Transition to IPv6 with HP

BlueCat Networks to Host Webinar on DNS, DHCP and IPAM Featuring Independent Research Firm

Facets of gTLD Registry Technical Operations - Registry Services

Technology and Finance Industries to Dominate New gTLD Applications

.CO Internet Selects Sedo to Broker Previously Unreleased .CO Domain Names

Hot Topics

Verisign

Security

Sponsored by
Verisign
Afilias

DNSSEC

Sponsored by
Afilias
dotMobi

Mobile

Sponsored by
dotMobi
Neustar UltraDNS

DNS

Sponsored by
Neustar UltraDNS
Minds + Machines

Top-Level Domains

Sponsored by
Minds + Machines