
An unusual database problem at the giant social networking site could only be cured by taking the sort of action you normally take with a misbehaving PC
Ever been on the phone to IT support and they told you to turn it off and then on again, and that sorts it out?
Facebook last night had that sort of problem. So they turned the site off and on again. And it fixed their problem. Literally.
As Robert Johnson, its director of software engineering, explained in a slightly shamefaced blogpost, the site was offline for about two-and-a-half hours – its worst outage in four years – due to some technical changes that Facebook had made.
It wasn't only the site itself which went belly-up; the Like buttons (which connect back to Facebook) vanished on 350,000 sites too, and the API which powers its OpenGraph system had serious problems.
The logistics of running a vast network like Facebook mean that you don't stick all your servers in a single place, of course. Facebook runs a big caching operation, so that lots of servers replicate its content. The cache gets updated periodically; it sits on a network called tfbnw.net (for "the Facebook network": you can see it here in this traceroute to Facebook, which shows what the intermediate networks are between one site and Facebook), which in effect sits like a ring around the "central" Facebook site.
Sometimes, things go wrong in the cache as values go out of date; but that's no problem, usually, because you can overwrite them with correct values from the centre. At least, you would like to.
Here's how Johnson explained it:
"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
"The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn't work when the persistent store is invalid."In other words: something went wrong inside the circle. And that wrong value got passed out to all the fbnw.net servers that would normally serve up Facebook pages.
Source:guardian.co.uk