Archive for June, 2007

Credibility Versus Popularity

June 29th, 2007 by Dr. Riza C Berkan, CEO

When we state that hakia is currently analyzing Web pages to rank them by semantic methods and credibility criteria, many people are asking, what is “credibility” of a Web page, and how can it better than “popularity”.

In a nut-shell, I can tell you right off the bat that credibility is the real thing, and popularity is an approximation (cheap imitation) for credibility. Let’s walk through some cases to make this point clearer.

Case-1: Domain name
Let’s take the query “madonna”. Obviously the most credible site is Madonna’s official Web site, madonna.com. Thanks to the fiercely competed market for Web names, only Madonna can afford this name. Thus, the very first criteria for credibility is the domain and how well that domain is controlled. For example, domains like .mil, .gov are totally controlled for official announcements, thus it is very unlikely to see junk content.

Case-2: Rating
In this day and age, there are lists available of the most quality editorial pages. One can use Hitwise or Alexa data to go through the highly rated sites, and easily edit/modify this list to assign a rating. CNN.com, for example, is obviously a credible source. The list can be extended to include all company names, with their official page, and all the product offerings they have. So, if the query is about XBox, the system will know what sites are credible. Popularity algorithms were devised 10 years ago – at a time when such information was not readily available. But today, popularity computation by means of link referrals is like reinventing the wheel.

Case-3: Content
The page content can be analyzed for proper language, lay out, and links. These types of analyses are very common today, but semantic analysis is necessary to assess how well a given query is represented by the page content. Without this crucial element, the credibility assessment is likely to fall short.

Using the three measurements described above, the credibility of a Web page for a given query can be easily assessed. What will be missing in this method would be the “out-of-the-list” items which are obviously not popular to start with. However, the list based coverage has become the easiest practice today given the current state of data availability, hardware capacity, and connectivity.

Conclusion: The decade-old exotic popularity methods to rank a Web page are increasingly becoming obsolete. The next generation search systems will definitely not need these approaches any longer due to the improvements in content analysis, and due to the availability of credibility measurements. With the same token, manipulating search results will become much more difficult by artificial means.

When hakia’s content analysis of the Web pages is completed, these advantages will become highly visible to the end user. Until then, monitor the progress at our beta site, hakia.com.

A Soap Opera: Privacy & Web Search

June 19th, 2007 by Rob Wyse, Chief Communication Officer

One of the extended benefits of the semantic search technology is about its independence on privacy sensitive information. If some folks may be wondering what’s this all about, we have made up a soap opera to illustrate the extent of what can happen.

Let’s suppose that you just bought an expensive car, and you want to make sure no one will steal it. So you go to a search engine X and type “how to steal a car?” just to make sure you cover all the corners.

Search engine X records your IP address, date of the inquiry, and the query. Also it inserts a cookie in your computer with an expiration time of 35 years.

You look at the search results, and click on links to get more information.

Search engine X actually has links to first connect to itself, thus tracing your clicks which are recorded along with your cookie ID.

Fine, so what? Now, let’s go one step further. You have an email account from the search engine X. While you created an email account, you gave your name, address, etc. In the process, search engine X finds the cookie it inserted earlier. At this point the database of this search engine includes the following;

Your name
Your email address
Your IP address
The queries you entered
Links you clicked

If this information is not neatly organized in one corner of their database, the bits and pieces can be put together quite fast.

Assume that, some time later, there was a high-profile criminal act involving a stolen car in your neighborhood, and the criminals are known to use the Internet. The law enforcement agency asks (with a court order) the Search Engine X to give the data for the activities of all people with an IP address range which includes your IP address.

Search Engine X gives your data to the law enforcement agency because your IP falls into the range along with others.

tv.gifWell, well, well. While you are watching TV enjoying your evening, your name is hanging on a suspect list in the Police Station. A detective is looking at your query “how to steal a car” and reviewing the links you clicked, and reading which documents you read earlier. Since you are a nice person with no criminal record, they pass on you. You are totally unaware of the entire thing, of course. You go on with your life as if nothing happened.

The privacy issue on the Web is divided between the people who are terrified about the story above versus those who don’t care. While we are making up this soap opera based-on technical feasibility, our intention is not to pull you in one direction or another. (You can vote on how much you trust your search engine in our latest poll.)

But if you have not yet assessed the possibilities, you may like to consider the fact that DOTS can be connected very easy in this day and age of Web search engines. Not to mention the fact that all this compromise from your privacy may be helping the search engine X to improve its algorithm to become more profitable.

At hakia, we can guarantee that your information is not used to improve our search algorithms. Because our semantic search technology does not need such information to be collected. Accordingly, hakia.com offers more privacy friendly search option, that is if you care about it.

Hey Folks! The Race is Under the Water.

June 7th, 2007 by Kartal Guner, Chief Architect

iceberg1.gifThe picture I put here is an iceberg. The tip of the iceberg is much smaller than the bottom part as everyone knows. This is a commonly used analogy to illustrate the difference between POPULAR and LONG-TAIL queries.

In scientific terms, the difference in magnitude between the POPULAR and LONG-TAIL is actually much larger. Long-tail is more like a black-hole, seemingly infinite, dark, cold, and merciless against popularity algorithms. These are the unpopular, longer than usual, complex, unique, and personal queries. These are the ones that need precision, accuracy, and relevancy.

You know this. We at hakia know this. Silicon Valley knows this. Tech writers know this. VCs know this. Everyone in the search industry is supposed to know this. Very well.. Then what’s the problem?

The problem is this: When a new search engine is mentioned today, with the claim of better search experience, why is there no mention of how they will handle the LONG-TAIL? Why is there no mention of how Google, Yahoo, MSN will handle LONG-TAIL in the future with their popularity-based arsenal of algorithms? Then how can a human-labor based search engine (like Wikia or Mahalo) be viewed as progress in search with the obvious impossibility of handling the LONG-TAIL?

Hey Folks, the race is under the water, the bottom of the iceberg.

I will continue to welcome all those new advances in fancy user interfaces, visual tools, and so forth. But, I raise my hat to some other newcomers who are talking about the LONG-TAIL. I want to see new mathematical methods, learning algorithms, nonlinear mapping techniques, and linguistic/semantic approaches coming in.

Most of all, I want to see the end of this clear and present danger of “avoiding the LONG-TAIL talk.” Progress requires open communication about the dark areas, not about the shining tip of the iceberg.

hakia Invites You to Play With Our Semantic Resources

June 1st, 2007 by Dr. Christian Hempelmann, Chief Scientific Officer

kiki.jpgWhen I first heard the term ontological semantics (OntoSem) as a doctoral student of linguistics some ten years ago, I went for cover. My main research interest at the time was the linguistics, more specifically semantics, of humor. But my mentor suggested to expand my research interest to cover semantics in general, and to combine this with my interest in computers and do research in formal theories of semantics. After all, there aren’t many tenure-track professorships for linguists who apply their research to humor.

My rite of passage into the world of OntoSem was reading the relatively advanced draft of Nirenburg and Raskin’s monograph on the subject. About five times. It struck me as a daring and encompassing approach, and I definitely wanted to help develop it. After I worked in OntoSem research for many years, applied it in many different domains, including humor, my current position is the specific challenge to implement it for internet search at hakia.com.

Luckily, hakia users don’t have to sweat through 400 pages to get relevant search results. But to understand the theory and technology that is now being introduced in the online beta of hakia to help produce these results based on meaning, a look at www.ontologicalsemantics.com will be instructive. On this site, my mentor, Victor Raskin , has assembled links to much of the existing research in OntoSem, including an introduction that will whet your appetite for the research papers and, possibly even the book, a prepublication draft of which is available there. You will want to read it about five times, I suggest.

Or you can take a look at our OntoSem resources at the hakia lab. You can do lookups in our constantly improving lexicon and see how the meaning of a word is expressed through ontological concepts. Further developments in OntoSem are taking place behind the scenes at hakia all the time. So I’ll now return to working on the chapters of the next OntoSem book, based on our implementation of it in search. Expect them to appear on www.ontologicalsemantics.com soon. You know how many times you should read them, right?