Proper Testing of a Semantic Search Engine: Part-2 Evaluations
In my earlier posting on this subject, I have outlined what is required to conduct a proper test for semantic search engines. My red-line comment was “you cannot test a semantic search engine using a dozen queries.” My estimate was that a proper test set must include at least 220 to 330 queries, if not more, depending on the objective. Now I want to talk about the other side of the coin: how to evaluate the results.
A recent article titled Search Engine Fatigue is full of new information about the low satisfaction levels of the conventional search engines. Studies of this nature are often to the point since they mainly focus on the end user experience and satisfaction. Sampling size and distribution (of the end users) are the only parameters here to make the study a meaningful one.
Questionable reports arise out of a single point evaluation where the evaluator is one person, or a small group of people. They undertake the holy task of evaluating search performance with the assumption that their single point experience represents the average behavior or perception. I am mainly writing this blog post for these kinds of evaluators. So my starting red-line comment will be: You cannot evaluate a semantic search engine using the usual criteria applied to keyword indexing search engines. Because it is apples and oranges. Here is how to approach this task carefully:
Evaluation of the Search Results Page
The decades old idea of “document searching” is no longer an acceptable norm for the future of Web search. In the realm of semantic search, what is offered is a 1-step search, not 2 steps search (the 2nd step searching inside the documents). Since keyword-indexing technology cannot bring coherent sentences on the results page on a consistent basis, this criterion has never been on the agenda of the evaluators historically speaking, including institutions like TREC until recently. For example, look at a typical Google result below to the query “what does it mean to cross the rubicon?”
This result provokes the user to open the document for the 2nd search, which consumes user’s time for an indefinite trip. Thus, the result snippet itself must tell the whole story clearly at the first glance whether the destination contains accurate information. Here is an example from hakia BETA:
Raising the expectations in this manner is not a cosmetic choice of a new technology builder. Semantic search is not semantic search if this level of 1-step clarity is not present.
Relevancy
The clarity of perception described above directly refers to relevancy. Again, the relevancy of a search result is different from the relevancy of the pointed document in the vision of 1-step search. If there is no 1-step search relevancy offered by a search engine, then we are not talking about semantic search.
There are many tricks in the conventional search paradigm to fool the user into thinking that there is a 1-step relevancy in the search results page. For example, see the Google result for the query “what proteins are useful for the body?”
Here the query is replicated pointing to another search page (2nd step). Even in the 2nd step, there is no 1-step relevancy, suggesting the user for the 3rd step. You can see that by clicking on the Google result above and counting your clicks. While a single mouse click costs you 0.5 seconds and 0.01 calories, it actually costs 3 minutes of reading per Web page (assumed to be the average). With 2nd search, you are forced to read the documents or Web pages one after another (3 min + 3 min + 3 min +…) which can take substantial time. That is the main reason for Search Engine Fatigue.
Freshness of Results
Conventional search engines want you to think NEWS is just another category tucked away, and may be offered as the 2nd step. There is a reason for that. Search engines that rely on collecting statistics (like Google’s PageRank algorithm) will have no time to collect linking-votes from the news content for obvious reasons. Therefore, they cannot leverage their full search potential in such cases. But in real life, information that emerged today has the utmost value. Think about prices, politics, economy, medical news, etc. Semantic search engines will offer full capacity of performance from news content because they don’t rely on collecting statistics. Thus you can see news results more often in the front page as 1-step relevancy. Comparing a conventional search engine to a semantic search engine, NEWS and dynamic pages must have the highest priority for evaluation, but not treated as a categorical search.
Coverage
Semantic search engines are more likely to offer abundance of search results related to the different aspects of the search term. This is due to the underlying semantic map that they must implement. For example, comparison of the search term piano between Google and hakia-BETA illustrates this property. Accordingly, comparative evaluations must take into account the coverage of the aspects of the search term, and with the vision of 1-step search relevancy.
Conclusion
For all those who would like to evaluate semantic search engines compared to conventional search engines must take into account the considerations above. In short, 1-step search relevancy, freshness, and coverage are the three main criteria for the future of Web search, and to be the antidote for Search Engine Fatigue. One can add “credibility of the search results” to the equation, which is self-explanatory.

February 26th, 2008 at 8:50 pm
[...] ill advised) to reflect the comparative relevance of the answers. Even according to my friend Dr. Riza Berkan CEO of hakia, the number of queries necessary to do this is beyond the scope of what the normal [...]