Everything You Always Wanted to Know About Semantic Search, But Were Afraid to Ask (in SemTech Conferences)
In the wake of SemTech09 conference, I thought this title would do justice to those mischievous readers who happened to have the good fortune to stumble across this blog posting. The conference was great, neatly organized, carefully secluded in San Jose, California. One of the highlights was the Semantic Search Keynote Panel with all the players on stage (Ask, Bing, Google, hakia, TrueKnowledge and Yahoo!) as seen in the picture below.
Bear in mind that semantic technology to “any” audience can be as heavy and stifling as what the topic of stem-cell research can be to the high-school students. Thanks to Carla Thompson from Guidewire who did a terrific job to come up with discussion topics and moderating the panel, everyone survived the ordeal without any sign of dozing.
Despite the positive outcome, some responses from the panelist made me wonder if we should go back to the basic question of “What is semantic search?” Or, better to discuss: what is NOT semantic search? Here is my list:
Structured data. Folks, structured data is NOT semantic technology. A database that can pull out a list of beer brands, their manufacturers, and their contact information, given the query “social drinking”, has nothing to do with semantics. I say this because some people seemed to be under the illusion that there must be some kind of semantic technology if a search engine brings such structured data in SERP. It is a trick as old as the ancient Egyptians who used beads on strings to organize harvesting information. Organized information is not semantics.
Morphology. If a search engine is robust (brings the same results) to a query “top ten” versus “top 10″ by recognizing “ten=10″ it would be a stretch of imagination to call it semantic. Anyone can come up with such a replacement list without a drop of linguistic knowledge. Similarly, distinguishing the name Fisher from the noun fisher by detecting the capitalization of the first letter does not go beyond the application of simple linguistic rules. These capabilities are not semantic search capabilities.
Syntax. It is true that certain level of semantic information can be salvaged from syntax. Unfortunately, if syntax was enough to detect the meaning of text, then an 8 year old kid who developed a perfect reading skill (syntactically parsing strings of letters and words in English) would be expected to understand the meaning of Shakespeare’s works. The difference between reading and understanding is the difference between syntax and semantics. Former requires the skill to parse things out, whereas the latter requires vast amount of associative knowledge.
Statistics. An infinite number of monkeys with a keyboard would eventually type the complete text of the declaration of independence. This is statistically correct. However, if a search engine is expected to become semantically apt using statistical algorithms, one has to wait until the monkeys finish their job. There is no place for statistics in semantics. For example, let’s take this sentence: “Polar bears don’t eat alligator eggs before dawn.” I am sure you have never seen this combination of words before in your life. But, the fact that you can understand what it means is simple evidence that semantic brain does not need statistical sampling. Meaning does not emerge from statistics. It emerges from associative knowledge.
Scalability. Scalability is the narrow bridge between science and technology. What you can carry from the science side to the technology side over this bridge determines the level of capabilities in real world. The science of semantics is huge stemming from the basics of philosophy. But, Web search is a highly particular problem with stringent constraints (narrow bridge). Designing semantic algorithms to drive a Web search engine is like walking on egg shells and requires a completely new approach. Therefore, a semantic algorithm can be very sophisticated but it does not mean it is a semantic search algorithm suitable for the Web.
The five issues I addressed above explain what is NOT semantic search and should guide the interested readers to question emerging technologies in SemTech10. Structured data, morphology, syntax, statistics, and scalability are the key questions to discuss. Obviously, no one would be afraid to ask these questions unlike what the title suggests, but if you understood the title, it was your semantic brain in action. That was my last example to “what is semantics” in this article.