A New Commercial Ontology from hakia
Perhaps the world’s first, we are proud to announce our upcoming Commercial Ontology (CO). What is a commercial ontology? If you asked this question you have just touched on an important distinction: fantasy versus reality. In the context of World Wide Web, the CO is the realistic version of an ontology for the reasons explained below.
The Realities of the Web
We have accomplished two important innovations in building the CO. First, the development of concepts and lexicons followed a strict guideline of the realities of Web operations. What were these realities? Most of the search queries on the Web reflect a single dimension of intent, almost exclusively relevant to commercial topics. Note that the interpretation of “commercial topics” must be taken in the broadest sense possible. For example, if you were looking for “the benefits of foot massage” or “the director of the movie Last Emperor” your queries fall into the same commercial pattern. One particular distinction of the commercial pattern is that they come in short packages including a name (onomasticon), or always referring to something sold, bought, watched, heard, etc.
In contrast, many ontologies (if not all) that have been built to date, or claimed so, are focused on the use of language in the general sense, but not in the sense of commercial patterns on the Web. Therefore, their usefulness when tackling the Web search queries is greatly compromised, sometimes to the point of absolute failure. If such an ontology could disambiguate a dozen of different senses of the word “kill”, it would be sad news that the last 100,000 queries in the search logs did not include a single occurrence of the word “kill”. Like drowning in 2 inches-deep water, such ontologies will not utilize their disambiguation skills nearly 80% of the queries because the queries include nothing but onomasticons and/or they are too short (under-articulated).
The Sequence Approach
The second innovation used in the CO is the use of sequences instead of single words. A single word, like “kill”, is the most ambiguous state of information and is hardly used in human communication without a strong underlying/implied context. As a result, building a natural language processing (NLP) systems by taking single word as the unit of computation is an invitation for disaster.
In contrast, word sequences (2 or more words) are inherently safe and highly descriptive. Take “road kill”, for example. This sequence describes a corpse of an animal killed on the road by a passing vehicle. If a language processing system takes the sequences as unit of computation, 99% of the ambiguity problem vanishes. There is no need to process the word “kill” and “road” separately, trace their senses, and locate convergence to identify the meaning of “road kill” if you can just take the sequence “road kill” itself as your unit of computation for mapping. This is depicted below:
|
Note the number of traces required in a conventional ontology approach compared to the sequence approach. The sequence approach requires a lot of data storage space (which is dirt cheap) whereas the conventional ontology approach requires a lot of CPU for a simple mapping task (which is expensive). But the bad news does not stop there. The trace routes in conventional ontology requires manual work (impossible to automate) whereas sequence-based ontology can be easily built via automation.
I realize only a handful of people will understand the second point above. Nevertheless, the scalability and performance of the end product will speak for itself when we put the testing platform on-line.
Usage of the Commercial Ontology
The immediate use of the CO is related to search queries, or document characterizations, that are not tied to any advertising in conventional systems. This unrecognized domain of search queries and characterizations means loss of revenue. hakia’s CO is designed to fill this gap. For example, if the search query or page characterization is “beat generation” the CO can map it to “literature” on the fly. As a result, systems using the CO will have much deeper understanding of the incoming terms, thus will be able to recognize the underlying intent beyond the face value of the words. The same capability can be used in a number of places other than advertising with the same effect.
Stay tuned for the release of the first version of our commercial ontology.

July 27th, 2009 at 11:47 pm
This is very interesting… I’ve not heard of CO before… how would this relate to Google LSI and other related algorithms?
It seems that CO is a specialty that relates to commercial intent, while LSI is more general purpose?
I’m both anxious and curious to see all the search engine changes that will arrive in the coming 10-20 years…