When Microsoft fell off the grid, its first reaction was to cover its butt.
Jan 26, 2001 | Anyone who works with or on today's computer networks knows how fragile and crotchety they can be. They are rarely more than a step or two away from disaster, and so, as Ellen Ullman once put it, "The real-world experience of system managers is a kind of permanent state of emergency."
The sirens were screaming in Redmond this past week, as it was Microsoft's turn to experience the sheer awfulness of network collapse. The company had to admit that it had fallen off the Internet for the better part of two days because, er, well, somebody screwed up. Caught up in the down time were not only Microsoft's own Web site but its MSN.com Internet service, MSNBC, Hotmail and Expedia and even Slate.
What had happened that could bring this legendarily proud "Business @ the Speed of Thought" to its network knees? Though there was much idle chatter about hacker attacks in the wake of the initial outage, it gradually emerged that Microsoft had what those in the business call "a DNS problem." A big DNS problem.
DNS is the Domain Name System, which translates the Internet's fundamental numeric (IP) addresses -- opaque strings of digits like 208.178.101.40 -- into the more familiar verbal monikers by which we address computers on the Net, like www.salon.com. Without the DNS, there'd be no "dot com" or "dot" anything.
The DNS is the reason that we can remember Web addresses, and even have verbal fun with them -- like the pranksters who spammed the Internet registry with graffiti-like anti-Microsoft slogans by simply registering them as domain names. Compare it to the telephone system -- which requires that we attempt to remember (or recall from tattered phonebooks or balky software programs) long strings of meaningless digits in order to contact anyone -- and you realize that the Internet could be a lot harder to use than it is.
But in order for the DNS to work, we depend on a distributed network of computers that ask and tell one another where to find the various names, and update one another when addresses change. These domain-name server computers keep the Internet running -- but they can also form what engineers call a "single point of failure," a weak spot in the chain between you and a Web site you want to visit. Large Internet operations typically safeguard themselves by running multiple domain-name servers -- so that if one goes down the others keep responding to the "where are you?" questions pouring in from the Net. This is called redundancy: a bad thing in writing, a prized thing in networking.
Until its problems hit, Microsoft was running four domain-name servers -- but apparently all four were located together (physically and in terms of their network addresses), so that when the hapless Microsoft employee who misconfigured a router last Tuesday knocked them off the network, the corporation's entire Internet presence gradually dimmed and died. This, then, was a double human error: A poor engineering choice designing the system's architecture in the first place made it highly vulnerable to an operational error down the line.
The latter can happen to anyone, but the former should be something a technological colossus like Microsoft can easily avoid, networking experts say -- and they flayed Microsoft mercilessly for the goof on mailing lists and in press reports.
Get Salon in your mailbox!