Facebook apologizes for ‘worst outage in over four years’

Facebook has Facebooked about yesterday’s massive internet-wide DNS error debacle , explaining how the site came to be unavailable for so many hours .

The explanation, quoted below, is lengthy and- spoiler alert- they had to turn off Facebook to fix the problem. So there you go: they can turn off Facebook:

The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed… Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.

The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site.

Facebook engineers say they’re working on configurations that deal with feedback loops more “gracefully,” and that other systems in operation for Facebook might assist in that. They also apologized for the OMG epic Facebook meltdown, and say they take Facebook’s reliability and performance “very seriously.”