In response to the current interest in Semantic Search Engines (SSEs), new debates are emerging in the market as to how good SSEs are, or how good they can get. That brings us to the subject of testing. This is a long subject, therefore I have divided these considerations into several blog entries. This one focuses on preparing an appropriate query set for testing
I am seeing a number of attempts to evaluate SSEs. I am alarmed by how the testing can be trivialized so quickly without proper guidance, or the required background knowledge. I must immediately start like this:
You CANNOT test a Semantic Search Engine using a DOZEN QUERIES!
Using a handful of queries means that the tester and/or evaluator is not aware of the combinatory permutation space of all applicable considerations. In such cases, there is always an underlying favoritism using convincing arguments centered around the selected examples.
Here is the “absolute-minimum” short list of considerations to test SSEs:
QUERY TYPE:
A proper “test” case must include all possible variations of a query/question as listed in the table below. The column called “sampling” indicates the minimum number of cases to be tested for each variation.
| Table-1: Query Types |
Sampling |
| keyword, phrase, sentence, what, where, when, how, why, which, who, is/was/does |
11+ |
These variations test whether the search engine is sensitive to different aspects of the requested information. The minimum sampling is 11+ where the + indicates more variations like how much, how many, and whose. A scientific analysis can include more than 100 types of questioning patterns in English.
QUERY LENGTH:
Each one of the 11+ queries must be tested for different query lengths. The length of a query can be counted as the number of significant words after noise elimination.
| Table-2: Query Length |
Sampling |
| 1, 2, 3, 4, 5, 6, more than 6 |
7 |
This is a very important spectrum. The queries (of any type in Table-1) with 1, 2, and/or 3 significant words are considered “general” concept questions whereas queries with 3 or more significant words are specific questions entering the “long-tail” section. Thus, the testing sample has already increased to 11 x 7 = 77. However, if you must shorten it, you should at least sample the fat-tail (1,2, 3) versus long-tail (3 or more) queries. This would put the permutation of cases to 11 x 2 = 22.
CONTENT TYPE:
To do a quick job, you can compile 22 queries as outlined above, but in what subject? You have to cover a variety of subjects because semantic search capability can be more effective in one subject compared to another.
| Table-3: Content Type |
Sampling |
| medicine, law, politics, entertainment, sports, shopping, tourism, computers, science, education |
10 |
In its most general case, there can be 10 different content areas for SSE testing. This will put the minimum number of sampling questions to 11 x 2 x 10 = 220.
SENSE DISAMBIGUATION:
The 220 test queries as suggested above do not include sense disambiguation tests. This would require another set of queries. Among the number of ways of compiling such a test set, the shortest would require examining the 220 queries and focusing on their equivalent articulations. For example, “when did the Roman Empire fall?” can be articulated as “when did the Roman empire collapse?” The sense of “fall” in the first query is the same as “collapse” in the second query. If both queries bring overlapping results, it means that the SSE is able to detect the right sense of the word.
For the sake of the argument, if we assume all fat-tail queries (1 or 2 significant words) did not include any event (verb), then we could at least double the rest of the queries using equivalent articulations. This would require 110 additional queries, bringing the total to 330. Note that this is a very limited testing of the depth of the semantic capabilities.
The conclusion of the first part is this. Testing of a semantic search engine requires at least 330 queries just to scratch the surface. hakia’s internal fitness tests, for example, use couple of thousand queries. Therefore, if you see any report or article about the evaluation of a search engine using a dozen of queries, even if it includes valuable insight, it will tell you nothing about the overall state of that search engine.
Preparing test cases is only half of the equation. How to evaluate the results is a whole different story. I will post part-2 to discuss this matter soon.