For almost as long as there has been a Web, there have been Web search engines. So one might reasonably ask why the deep Web has remained out of view for so long.
Traditionally, Web search engines have grown their databases through simple brute force. All the major search engines survey the Web by dispatching legions of simple programs known as spiders, crawlers, robots or harvesters to trace their way through the endless chains of hyperlinks that tie Web pages together.
That method works well for the static HTML pages and predictable URLs that make up the upper strata of the Web. But the deep Web resides mostly in databases, shielded by a lattice of registration gateways, session cookies and dynamically generated links. Unless an organization consciously chooses to share its data, by opening up an API or Web services feed -- the way Amazon books show up in a Google search -- then the data will likely remain unseen to most users.
New search engines now under development are exploring methods for penetrating the database barriers. BrightPlanet has developed a formula for brokering queries across multiple deep Web data sources at once, aggregating the results and letting users compare changes to those results over time -- a process known as "differencing."
That capability has attracted considerable interest from certain government agencies that shall remain nameless. "Some of our clients are spooky," says BrightPlanet COO Duncan Wittes. Other BrightPlanet customers include state governments, competitive intelligence researchers, and political campaigns whose "oppo" teams may want not only to search for what a candidate has said but also for what he or she may have "unsaid" over time.
Soon-to-launch Dipsie is pursuing an alternative approach to unlocking the dynamic Web, by deploying a kind of souped-up spider that penetrates barriers like forms, drop-down lists, dynamically generated URLs and session cookies. Dipsie's spider works by emulating a "well-formed user" that, from the Web site's point of view, behaves just like a real flesh-and-mouse user, enabling the spider to cache the kind of data typically visible only to a human user.
Other search developers, including IBM, Google and Intelliseek, are exploring their own approaches to mining the deep Web. But in the wake of this week's announcement, Yahoo is now the elephant in the living room.
Yahoo won't discuss the specifics of how its search algorithms work. But the company does acknowledge that its Content Aggregation Program will give paying customers a more direct pipeline into its search database. Yahoo Search vice president Tim Cadogan says, "Ultimately we want to search the whole Web for free," but he nonetheless sees the CAP program as a way of enabling "direct, structured relationships with content providers" to "deliver a higher-quality search experience for users."
It takes a fine ear for P.R. nuance to distinguish "higher-quality search experience" from "better results." Yahoo has issued copious disclaimers assuring non-paying customers that they will receive the same algorithmic treatment as paying ones. But the company acknowledges that paying customers will likely benefit from a "quality review" designed to help companies improve their chances of showing up in search results.
"Cadogan claims that people who send money can't count on getting better results," says Bray. "Do you believe that? I don't."
Get Salon in your mailbox!