"We're running into a complexity barrier in computing," says Steve White, senior manager for autonomic computing at IBM Research. "Computer scientists have done a great job of making software faster and cheaper. But we haven't paid as much attention to the people costs."
Maybe that's because, until recently, counting up the "people costs" was an inexact science itself. "Total cost of ownership" studies vary from platform to platform and often fall prey to vendor bias. Still, over the last decade, one common statistic has emerged: When it comes to running enterprise-level software, most companies spend twice as much on human talent than they do on licensing and acquisition.
While companies strive to reduce this two-thirds tax through lower labor costs (read: outsourcing), researchers are looking further down the road. One problem with hiring any human to fix or tune a system is the assumption that the system is fixable at the human level and that once fixed, it stays fixed. A quick review of recent software history, however, proves otherwise. For at least three decades now, programmers have joked of "heisenbugs" -- software errors that surface at seemingly random intervals and whose root causes consistently evade detection. The name is a takeoff on Werner Heisenberg, the German physicist whose famous uncertainty principle posited that no amount of observation or experimentation could pinpoint both the position and momentum of an electron.
"A lot of the bugs we're seeing in modern systems have been plaguing programmers from the beginning of time," says Fox, the head of Stanford's Software Infrastructures Group. "The only difference now is machines just crash faster."
One remedy to this situation is a strategy so simple every user has relied on it at least once or twice: Reboot the machine and start from scratch. Fox and Stanford University doctoral student George Candea have collaborated on a series of papers investigating a tactic originally known as partial rebooting but which Candea now calls "micro-rebooting." Instead of digging through the source code to fix errors, their strategy calls upon system managers to simply reboot the offending components while leaving the rest of the network operationally intact.
"In a lot of cases, rebooting cures the problem much faster than fixing the root cause," Candea says. "We see this all the time with PCs. Rebooting takes 30 seconds to a minute, enough time for a bathroom break. When you come back, the problem is usually gone and you can go back to work."
Rebooting the components of a computer network is, of course, more challenging than rebooting an individual PC. Network administrators have to guard against the lost data and whatever performance loss such outages might incur. Still, thanks to clustering, a strategy that bundles low-cost hardware resources in a way that makes it easy for one machine to pick up another machine's workload in the event of a failure or shutdown, most e-commerce networks already have that built-in safeguard. Fox and Candea have worked together to develop a process they call recursive restartability, in which an automated network manager systematically goes through a network's node tree, rebooting each branch as a form of preventive maintenance.
Lately, however, Candea has been looking at an even more sophisticated approach, one that gives a system its own ability to target and correct failing components. He calls it crash-only computing, and the strategy is to marry micro-rebooting with the increasingly popular diagnostic tactic known as fault injection. Candea has built a Java application server divided into two main components: management and monitoring. The monitoring side periodically sends queries into the software system and watches for any sign of bad data.
If the messages trigger an erroneous response, the monitors' own components compare notes on the error path, generate a statistical estimate of the faulty component, and send a signal to the management component to perform a micro-reboot. According to a paper released last year, Candea's self-monitoring Java server was able to increase system dependability by 78 percent while reducing service outages from 12 per hour to zero.
It's at this point that a technology journalist must fight the urge to evoke biological metaphors, an urge all the more compelling because many programmers, IBM's White included, consider experiments like Candea's a first step toward autonomic computing systems that manage internal resources the same way the human body's own autonomic nervous system regulates heart rate and breathing.
"First of all, I'm a real fan of ROC," says White, referring to recovery-oriented computing. "It's that notion of self that I think is the key idea of autonomic computing and the most revolutionary part."
Candea, for one, is hesitant to invoke biological metaphors but notes that, for discussing overly complex systems, sometimes they are the only parallels available. Like the body's own autonomic system, which operates independently of the conscious brain, his Java server works best when the monitoring component is strictly isolated from the management component. The same goes for all components. Without rigid functional boundaries, the software equivalent of cell membranes, it is almost impossible to tell which component is in need of a restart.
"It's all about having isolation of what we in computer-speak call the fault domain," says Candea.