Proper Testing of a Semantic Search Engine: Part-1 The Query Set
In response to the current interest in Semantic Search Engines (SSEs), new debates are emerging in the market as to how good SSEs are, or how good they can get. That brings us to the subject of testing. This is a long subject, therefore I have divided these considerations into several blog entries. This one focuses on preparing an appropriate query set for testing
I am seeing a number of attempts to evaluate SSEs. I am alarmed by how the testing can be trivialized so quickly without proper guidance, or the required background knowledge. I must immediately start like this:
You CANNOT test a Semantic Search Engine using a DOZEN QUERIES!
Using a handful of queries means that the tester and/or evaluator is not aware of the combinatory permutation space of all applicable considerations. In such cases, there is always an underlying favoritism using convincing arguments centered around the selected examples.
Here is the “absolute-minimum” short list of considerations to test SSEs:
QUERY TYPE:
A proper “test” case must include all possible variations of a query/question as listed in the table below. The column called “sampling” indicates the minimum number of cases to be tested for each variation.
| Table-1: Query Types | Sampling |
| keyword, phrase, sentence, what, where, when, how, why, which, who, is/was/does | 11+ |
These variations test whether the search engine is sensitive to different aspects of the requested information. The minimum sampling is 11+ where the + indicates more variations like how much, how many, and whose. A scientific analysis can include more than 100 types of questioning patterns in English.
QUERY LENGTH:
Each one of the 11+ queries must be tested for different query lengths. The length of a query can be counted as the number of significant words after noise elimination.
| Table-2: Query Length | Sampling |
| 1, 2, 3, 4, 5, 6, more than 6 | 7 |
This is a very important spectrum. The queries (of any type in Table-1) with 1, 2, and/or 3 significant words are considered “general” concept questions whereas queries with 3 or more significant words are specific questions entering the “long-tail” section. Thus, the testing sample has already increased to 11 x 7 = 77. However, if you must shorten it, you should at least sample the fat-tail (1,2, 3) versus long-tail (3 or more) queries. This would put the permutation of cases to 11 x 2 = 22.
CONTENT TYPE:
To do a quick job, you can compile 22 queries as outlined above, but in what subject? You have to cover a variety of subjects because semantic search capability can be more effective in one subject compared to another.
| Table-3: Content Type | Sampling |
| medicine, law, politics, entertainment, sports, shopping, tourism, computers, science, education | 10 |
In its most general case, there can be 10 different content areas for SSE testing. This will put the minimum number of sampling questions to 11 x 2 x 10 = 220.
SENSE DISAMBIGUATION:
The 220 test queries as suggested above do not include sense disambiguation tests. This would require another set of queries. Among the number of ways of compiling such a test set, the shortest would require examining the 220 queries and focusing on their equivalent articulations. For example, “when did the Roman Empire fall?” can be articulated as “when did the Roman empire collapse?” The sense of “fall” in the first query is the same as “collapse” in the second query. If both queries bring overlapping results, it means that the SSE is able to detect the right sense of the word.
For the sake of the argument, if we assume all fat-tail queries (1 or 2 significant words) did not include any event (verb), then we could at least double the rest of the queries using equivalent articulations. This would require 110 additional queries, bringing the total to 330. Note that this is a very limited testing of the depth of the semantic capabilities.
The conclusion of the first part is this. Testing of a semantic search engine requires at least 330 queries just to scratch the surface. hakia’s internal fitness tests, for example, use couple of thousand queries. Therefore, if you see any report or article about the evaluation of a search engine using a dozen of queries, even if it includes valuable insight, it will tell you nothing about the overall state of that search engine.
Preparing test cases is only half of the equation. How to evaluate the results is a whole different story. I will post part-2 to discuss this matter soon.
August 13th, 2007 at 6:15 am
Thats good info for all those who were questioning the *haki* usage and its Edge over other search engines.
Hopefully people get answers from this .
August 13th, 2007 at 8:31 pm
Interesting. Great summary and intro to testing semantic search. Testing is important, and it seems you all are taking it seriously, and approaching it scientifically. I’ve been using Hakia.com for all my searches over the past few weeks (having just discovered it), and I can tell that you do take your work (and testing) seriously. I’ve been generally pleased at the results of my queries.
August 14th, 2007 at 5:41 am
Testing for regular keyword search engines (e.g. Google) is already a non-trivial matter involving gazillion queries. Semantic search, being more complex, naturally demands an even more complex test set. Thanks for sketching out the issues involved. I have a blog post pointing to this as well as to Powerset’s first public demo. For all the natural language/semantic search startups, the biggest test will be the public users, and that’s something no test set can substitute.
August 17th, 2007 at 1:52 am
I agree with this completely, thanks for the post.
August 31st, 2007 at 5:20 am
Hey guys,
Sorry it’s been a while since my last post, but things have been a little hectic as of late (well for a while now), but I have been keeping track and testing hakia as much as I can.
In fact, this is kinda what this post is all about – basically, as I’ve been using hakia, I’ve come across a few searches that don’t just didn’t look quite right. Yes, I know this whole post is about the process of scientifically testing hakia, so quick disclaimer, I’m just one man who happened to stumble upon a few dodgy looking results in the process of my normal searching, lol.
Anyways, I’ll start by mentioning the results you get for typing in physorg (http://www.hakia.com/search.aspx?q=physorg). As far as I know, physorg is a pretty reputable news site, but while there are often a couple of articles marked as news from within the site, I can’t find any hint of its homepage within the first couple of search result pages. In fact, while the first unmarked page (e.g. a page not denoted by the ‘news’ thing) is seemingly from within the site, it takes you to a 404 ‘webpage cannot be found’ frame.
Slightly similar are the results found for ’searchmash’ (http://www.hakia.com/search.aspx?q=searchmash) (googles not so secret secret test site). Basically, the top result you are presented with is from a page within the site (the privacy policy), with searchmash itself (the homepage) only turning up as the fifth or sixth result (dependent on whether you count the news article that was there).
The same could be said for yahoo’s alpha site (yahoo own little test site – search term, ‘alpha yahoo’ http://www.hakia.com/search.aspx?q=alpha+yahoo), which again doesn’t seen to rate anywhere within at least the first couple of pages (even though it rates as either the first, or within the first couple of results in both google and yahoo for the same search term).
Now, while I still have at least one more search to mention, I’ll take a break to say why the last few looked a little strange to me. This is because surely when entering the actual name of a website, I’d have just assumed that both the most ‘credible’ and ’popular’ result for that site would be its homepage, with the any other pages, or articles from or about that site, subsequently rating below that.
Anyways, my next one is something that might prove to be a little embarrassing for you guys. When you enter hakia to get its related gallery (http://www.hakia.com/search.aspx?q=hakia), the link to your own website within the ‘official website’ category, states: ‘BETA 12, November 2006, Full Operation will Debut 2007′. Errrr, guys just to let you know, but today’s date happens to be the 31st August 2007 and last time I checked you where on beta 15, lol. Compare that to the google results, which displays more or less the same info, but showing the current beta.
Anyways, to avoid this post getting any longer I think I’ll probably end it there, but my overall question here is whether you could look into the results I’ve mentioned, because they just don’t look right to me.
Thanks for your time and keep up with the good work.
September 3rd, 2007 at 5:31 pm
Hi again guys,
Since I got such a resounding response to my last query (just kidding, lol), I figured I’d mention a couple of other searches which I’ve noticed along the way:
Firstly, typing in the phrase, ‘world community grid’ (a grid computing effort established by ibm, for humanitarian efforts – http://www.worldcommunitygrid.org/) (http://www.hakia.com/search.aspx?q=world+community+grid), seemingly refuses to bring up its homepage anywhere within at least the first couple of pages. In fact, your first result appears to be the wikipedia entry for it – I ask you again, surely the single most relevant entry when you have entered in the actual the name of a site should always be the homepage of the site in question.
My next couple of queries are a little more random and you might need to bear with me as I explain them quickly, these are ‘e.t’ and ‘a.i’. Now, let me start with e.t – the odd thing is here that if you enter it without the dot (so in other words just et – http://www.hakia.com/search.aspx?q=et), hakia is seemingly unable to produce any kind of results for this, while just putting in the dot (http://www.hakia.com/search.aspx?q=e.t) takes you to the gallery for e.t the film. What makes it even stranger, is that this process seems to happen in reverse for a.i (in other words typing it on its own gives you results, while having the dot in it will give you none). Now, you may be wondering why I’m making such a fuss over this, but I just found it really disappointing that all this NLP and A.I that you keep boasting about, can seemingly be thrown by simply adding or removing a dot – to the point where it is apparently unable to make even a remote guess at the users intentions. This is in stark contrast to a comparison against some of the other search engines out there – in this case I tested both google and yahoo – which both brought up what appear to be (at least broadly) relevant results in both queries – with or without the dot. In fact, for many of the queries that I have mentioned (both here and in my last post), I’d say that these ‘mere’ keyword engines have come of looking at least somewhat better.
Anyways, please don’t take either of these last posts as me trying to embarrass you or anything, I merely felt you’d be interested in hearing about a few searches that (from what I can see at least) are not up to your usual levels. To that end, is there any chance of hearing anything back on them – it would be nice to hear they’re at least being looked into.
Thanks.