Scalability of Semantic Search on the Web

December 4th, 2008 by Dr. Riza C Berkan, CEO

racecarLet’s start with an analogy.

If you ever had a flat tire, you would know what it takes to change the tire. Using the standard equipment available in the trunk of any car, changing a flat tire would take anywhere from 15 minutes to 30 minutes. Doing the job requires minimum knowledge.

In Formula races, the time required for changing a tire would be under 8 seconds. It requires high-tech equipment engineered specifically for this task, and trained professionals to do it fast.

The difference between any semantic technology versus the one that will become a Web search engine is very much like changing a flat tire of any car versus a race car in Formula-1.

Powerset’s limited coverage (Wikipedia only) was a recent example that helped awareness of the scalability issue in the eyes of technology savvy readers. Without overcoming the scalability challenge, a semantic technology cannot become a Web application, nor can it become a solution in enterprises handling vast amounts of documents.

Compared to conventional indexing search engines (with popularity flavor), a semantic search engine comes with the burden of extra load. This is true because semantic algorithms do much more than what the indexing search engines do both at the back-end and on the fly. The extra load is comprised of concept maps (ontology) and lexicon that are necessary to analyze Web pages as well as the incoming queries.

Consequently, if you have an indexing search engine, it becomes a nightmare to add the extra “semantic” load on top of indexing. This is the basic reason why the conventional search engines like Google cannot be easily converted into semantic search engines unless the entire infra-structure is redesigned from bottom up.

At hakia, we have engineered a solution to diminish the extra “semantic” load by re-inventing the indexing operation that is suitable for semantic operations. It is called QDEXing, which stands for “query detection and extraction.”

QDEX is not a table of words versus document IDs, rather it is a table of extracted queries (word sequences) versus paragraph IDs. It can also be viewed as the decomposition of text into its most meaningful knowledge sequences. Once such a decomposition is done with accuracy, there is no need for a table like index, and no need for taking intersections. The extra load evaporates via direct access gateways.

Extracting query sequences from a typical Web page (500 words) is the key component in QDEXing. Normally, there could be up to 1000 query sequences extracted from a Web page that make sense by human inspection. However, the permutation space with 500 words is huge, and there could be billion possibilities of creating sequences. The challenge is finding the 1000 out of billion by a computer algorithm that makes sense. That is where the semantic technology is heavily used.

QDEXing is also the basis of identifying meaningful keywords for advertising applications as I explained in the previous blog article.

The scalability issue does not stop with solutions to the back-end. Analyzing the incoming query on the fly, and ranking the retrieved paragraphs from the QDEX are the second challenge. We have developed the SemanticRank algorithm just for this purpose, but I will talk about this in another blog post.

Our scalable solution enabled us to QDEX credible Web pages as you would see in our search results in a separate column. The document coverage of our QDEXing operation is increasing with favorable speed while hakia can handle queries within 1 second average response time.

delicious:Scalability of Semantic Search on the Web  digg:Scalability of Semantic Search on the Web  furl:Scalability of Semantic Search on the Web  reddit:Scalability of Semantic Search on the Web  

One Response to “Scalability of Semantic Search on the Web”

  1. Sadegh Kharazmi Says:

    I want to know Qdexing using two layer indexing system or using two indexing intergrate to each other.
    If Qdex extract the meaningful sentences which data structure it use to index them?

Leave a Reply