Beyond the Deep Web

Modern search engines are best equipped to handle the so-called “surface Web.” However, sitting below the static content on the surface of the Internet is a wealth of information that is much harder to index. This body of information has been called the “deep Web” because much of it is hidden in databases that can only be accessed via online forms that–while easy for humans to fill out–present a challenge for automated agents such as web-crawlers who need to determine what information is hiding behind the form.

But even if a web-crawler could determine how to fill out a form and could extract and index the “deep” content from a site–would such an index contain the full information potential of the Web?

Contrary to what you might think, the end goal of submitting a query to a search engine is not to find a particular web page. The goal is an answer to a question. How do I get from my house to the store? What time is a certain film playing at my local theater? While some kinds of questions are getting easier and easier to answer, most questions are far too sophisticated to ask a search engine and expect to get an accurate answer. For example, try Googling “How many Starbucks are between 2020 Broadway and 1732 W. 53 Street?” You’re not going to get the result you are looking for. Nor will you be directed to a web page where you can easily find the answer.

It seems unreasonable to ask a search engine these kinds of questions. Why?

  • For one, most search engines are keyword-oriented and you cannot really think of a way to write down the question. We’ve been bred to think of searches as sets of keyword combinations. What words can I put together to find the pages with the information I seek? Unfortunately, most real questions cannot be formulated as a set of keywords.
  • Another issue is that search engines are designed to return web pages. But since our primary need is answers–not web pages–search engines should be “answer-oriented” not web page-oriented. So, assuming we have solved the first problem–that is, specifying the question in a manner that the search engine can understand, we wish for the search engine to take the necessary steps to give me the answer I am looking for.

So, we see that the problem boils down to two measly problems: the input is wrong and the output is wrong! Yikes!

Let’s explore what we mean by “answer-oriented” a bit. One way of thinking about an “answer-oriented” search engine is the following. Assume my question is: How many movies has Francis Ford Coppola directed? Let’s say that using its web index, the search engine is able to find some relevant pages based on keywords in the question. Now, the obvious next step would be for the search engine to scrape the page for the number I am looking for (perhaps using the hint “how many”) and return to me that number. Now, this would be a helpful feature, but in reality it doesn’t do much for the searcher who could within a few seconds do a manual grep of the page and find the number he or she was looking for. But this entire scenario is still based on our current search paradigm–namely, that the results of searches are web pages.

Now here is another scenario. The user poses the question: How many Starbucks restaurants are between my house and my office? The first thing to note is that in all likelihood, no web page actually exists anywhere containing the answer we seek. It is also unlikely that there is a “deep Web” database somewhere with a row in it containing the needed information. But, it is highly likely that all the needed information is in fact available online. Certainly we could find the route from my house to the office using a mapping web site. And, we could find the addresses of Starbucks locations in the area. But what we need is more than information retrieval. We need information synthesis. Answering the question requires some computation. The hard question we’d like to answer is: can a search engine be smart enough to perform the needed computations (or outsource them) and then return the result?

Some challenges must be overcome to achieve this:

  • The search engine must “understand” the user’s question. As it stands today, search engines don’t really accept questions–just words. The words are string matched against an index. There are hardly any semantics associated with the query, and therefore the search engine has a very shallow understanding of what the user really wants.
  • The search engine must index more than just web pages. It must also index services that can perform computations. The search engine must also understand how the services work, most likely by having a description of the service interface. Alternatively, the search engine could somehow outsource the finding of the needed service. (UDDI could be considered a rudimentary version of this, but it is a “registry” based technology where the service provider must actively register the service. Instead, services should be “discovered” dynamically by the search engine so that a massive index can be built just like the index of static HTML pages. Obviously UDDI has not really caught on. When is the last time you searched a UDDI registry?)
  • If the answer to the query is not found on a static web page, but requires a service invocation, the computational resources must be allocated for the service to run. This search engine itself may take responsibility for assigning resources, or the service request could be floated into the “cloud” where processing would be assigned in a distributed fashion and the result returned asynchronously when it has been computed. (A side issue is the ability to estimate the computational cost for answering the question. Lower cost questions could be answered quickly, perhaps by the search engine itself. Higher cost questions would require more resource allocation and the result may not be returned for some time. Good estimation is essential here.)
Advertisements

Tags: , ,

About rsdunlapiv

Computer science PhD student at Georgia Tech

One response to “Beyond the Deep Web”

  1. Matthew Theobald says :

    You might check out isen.org about a technology to systematically catalog deep web interfaces.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: