Live Long and (Do Not) Prosper: Lessons and Reminders from Yesterday’s Wikipedia Outage

Home / Blogs

Live Long and (Do Not) Prosper: Lessons and Reminders from Yesterday’s Wikipedia Outage

	By Tom Daly Chief Scientist and Co-Founder at Dyn Inc
	March 25, 2010 Views: 10,798 Comments: 1

Yesterday’s Wikipedia outage, which resulted from invalid DNS zone information, provides some good reminders about the best and worst attributes of active DNS management. The best part of the DNS is that it provides knowledgeable operators with a great tool to use to manage traffic around trouble spots on a network. In this case, Wikipedia was attempting to route around its European data center because an over heating problem that caused Wikipedia’s servers at that location to shut down. This is a classic example of how important DNS is to disaster recovery planning. Effectively implemented, the failover away from the European site would have been transparent to the user and the ‘disaster’ would have been averted, showcasing the resiliency of the Internet at its best.

However, unfortunately, it is also a classic example of how devastating even ‘small’ errors in implementing such a DNS based failover strategy can be to site uptime. The reason small errors can grow into a big problem is because, while it is flexible, the DNS can also be incredibly unforgiving. The root (pun intended) of the issue is that the DNS works by storing information in a relative few authoritative name servers, from which recursive name servers pull and cache information as a result of a user’s request for the zone information. Thus, depending on the Time-To-Live (TTL) setting for the zone file at issue, the recursive server could keep the information from seconds to hours before going back to the authoritative server to get an update. Once an invalid zone is pulled to a recursive DNS server, the server won’t check back for new information until the TTL expires, which means that bad information can linger long after the zone information is fixed, sending user after user (after user) to the wrong place or to no place at all. This architecture is why, at least with respect to DNS zone information, Spock was wrong—long life is not prosperous.

The good news is that you can get the best from DNS while avoiding the problems caused by its architecture. The key is having a DNS solution that mitigates against both of the risks that caused the Wikipedia outage: the failure to notice (or prevent) the introduction of invalid zone data and the failure to use low latency DNS resolution (or short TTLs). What constitutes a best practice in the area is up for some debate, but our perspective is that, with respect to zone data, using a utility such as ‘named-checkzone’ for BIND or ‘tinydns-data’ for DJBDNS is a must as it affords administrators an opportunity to check zone data for errors and correct them before they go live. With respect to TTLs, best practice suggests that an optimal TTL is one that is half of the desired ‘mean time to repair’ (MTTR)—which is generally determined based upon the sites larger disaster recovery strategy—for the site.

By Tom Daly, Chief Scientist and Co-Founder at Dyn Inc

Dyn is the Internet IaaS Infrastructure-as-a-Service leader that features a full suite of DNS and Email Delivery solutions. Follow on Twitter: @TomDynInc and @dyninc.

Visit Page

Filed Under

Comments

DNS Contingency Switcher Paul Roberts – Apr 12, 2010 10:30 AM

We have produced a tool that will automate the switching of DNS records for disaster recovery or maintenance purposes. It integrates with Infoblox (using bloxTools) and VitalQIP.

Please see here for more info:

tuscany networks DNS Contingency Switcher

Cheers,

Paul

# 1 Reply | Link | Report Problems

The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

VINTON CERF
Co-designer of the TCP/IP Protocols & the Architecture of the Internet