The next generation of Web search engines will do more than give you a longer list of search results. They will disrupt the information economy.
Mar 9, 2004 | When Yahoo announced its Content Acquisition Program on March 2, press coverage zeroed in on its controversial paid inclusion program, whereby customers can pony up in exchange for enhanced search coverage and a vaunted "trusted feed" status. But lost amid the inevitable search-wars storyline was another, more intriguing development: the unlocking of the deep Web.
Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.
Today, the deep Web remains invisible except when we engage in a focused transaction: searching a catalog, booking a flight, looking for a job. That's about to change. In addition to Yahoo, outfits like Google and IBM, along with a raft of startups, are developing new approaches for trawling the deep Web. And while their solutions differ, they are all pursuing the same goal: to expand the reach of search engines into our cultural, economic and civic lives.
As new search spiders penetrate the thickets of corporate databases, government documents and scholarly research databanks, they will not only help users retrieve better search results but also siphon transactions away from the organizations that traditionally mediate access to that data. As organizations commingle more of their data with the deep Web search engines, they are entering into a complex bargain, one they may not fully understand.
Case in point: In 1999, the CIA issued a revised edition of "The Chemical and Biological Warfare Threat." It's a public document, but you won't find it on Google. To find a copy, you need to know your way around the U.S. Government Printing Office catalog database.
The world's largest publisher, the U.S. federal government generates millions of documents every year: laws, economic forecasts, crop reports, press releases and milk pricing regulations. The government does maintain an ostensible government-wide search portal at FirstGov -- but it performs no better than Google at locating the Hatfill report. Other government branches maintain thousands of other publicly accessible search engines, from the Library of Congress catalog to the U.S. Federal Fish Finder.
"The U.S. Government Printing Office has the mandate of making the documents of the democracy available to everyone for free," says Tim Bray, CTO of Antarctica Systems. "But the poor guys have no control over the upstream data flow that lands in their laps." The result: a sprawling pastiche of databases, unevenly tagged, independently owned and operated, with none of it searchable in a single authoritative place.
If deep Web search engines can penetrate the sprawling mass of government output, they will give the electorate a powerful lens into the public record. And in a world where we can Google our Match.com dates, why shouldn't we expect that kind of visibility into our government?
When former Treasury Secretary Paul O'Neill gave reporter Ron Suskind 19,000 unclassified government files as background for the recently published "Price of Loyalty," Suskind decided to conduct "an experiment in transparency," scanning in some of the documents and posting them to his Web site. If it weren't for the work of Suskind (or at least his intern), Yahoo Search would never find Alan Greenspan's scathing 2002 comments about corporate-governance reform.
The CIA and Dick Cheney notwithstanding, there is no secret government conspiracy to hide public documents from view; it's largely a matter of bureaucratic inertia. Federal information technology organizations may not solve that problem anytime soon. The deep Web search engines may just solve it for them.
Get Salon in your mailbox!