Archive for February, 2008

hakia.com is a Webware 100 Finalist

February 26th, 2008 by hakia Team

webware2.jpgWe started the week with a surprise email from Rafe Needleman: hakia.com is selected as a finalist in the 2008 Webware 100 Awards in the Search & Reference category.

Apparently, the editors of CNET Webware.com selected 30 products in each of ten categories – from nearly 5,000 entries- for voting. Thank you for nominating us!

Now voters will get busy and support their favorite Web 2.0 apps.
To vote for hakia, visit Webware 100.

hakia versus Google: Anti-gravity machine versus Slot machine?

February 22nd, 2008 by Dr. Riza C Berkan, CEO

I have never met Peter Norvig, Google’s director of research, except being in the same article that was published yesterday in Forbes. Our opposing views came together, which prompted me to write this blog entry with somewhat humorous sentiment.

First, I raise my hat to Peter Norvig for his analogy: semantic search = anti-gravity machine. According to the article, he said that semantic technology experiments he conducted in 1978 yielded results like a dancing bear. He seems to have subscribed to the idea that the near future of search will not go beyond being a slot machine as Chris Sherman describes today’s expectation. Having stated his views about the feasibility of semantic search, Dr. Norvig adds that Google is actually working on it behind closed doors (just in case!) Is it only me detecting lack of guidance and inspiration? Most likely, there is more to it than what is publicized.

We respect and admire Google (how can you not?) for its simplicity and performance. However, this is 2008, and bear the-semantic-search is not only dancing tango today, it is about to get into ice-dancing pretty soon. When it debuts with full force, anti-gravity machines will start to lift old expectations along with some peoples’ hats (if they are not glued to their heads).

What is puzzling to me is the persistent avoidance by many established search technologists of the question “what is beyond statistics?” What do you do when you have a good quality Web document that is not statistically sampled? This question will get more serious when we consider dynamic Web content (something increasing every day) where statistical sampling cannot mature before the content is outdated. Shortly, the long-tail. We have never heard of an answer so far.

Let me put it in another way. Do you chose your doctor, lawyer, spouse, religion, financial advisor, retirement plan, political hero, or baseball team by statistics? Or do you have your personal views? For the latter, you need to go beyond popular view. That is where semantics start.

For those who want to see the glimpses of bear the-semantic-search doing tango moves, I have listed a few queries below in a side-by-side comparison of hakia with Google, Yahoo, and MSN. Sorry that you have to sign up to access these pages, and to continue with your own experiments.

What proteins are highly expressed in the lungs?

Who is the best plumber in new york city?

What are the elemental forms of carbon?

What drug treats urinary tract infection?

These queries are only scratching the surface of the long-tail. Even with short and popular queries, you can see some nice moves with semantic search:

asthma
toyota
jamaica

These examples are here to show the signs of what is coming, and should naturally spark a question why Google cannot handle them: rank the most relevant and transparent result at the top, or have the most intuitive categorization of simple search terms. As I said, what you are seeing is only the beginning of an ultimately different search experience.

We have an archive of search comparisons not to fool ourselves after Google’s bad results are mysteriously corrected following public examples of this sort. It happened in the past, probably by accident! If it happens again, we will display the corrections here.

The question of semantic search is not an IF question, it is a WHEN question.

If the alarm bells are not ringing for “search” itself with the hope of a “longer WHEN”, it should be ringing for advertising. Search and on-line advertising are twin sisters, and the latter does not take jokes well.

hakia’s Semantic OntoParser

February 17th, 2008 by Dr. Christian Hempelmann, Chief Scientific Officer

Here, we explain how hakia’s semantic OntoParser takes a sentence, processes it, and produces a text-meaning-representation. This is the essence of making computers understand natural languages by means of ontological semantics resources and parsing algorithms.

Take the simple sentence “The outlaws ran cocaine into the United States.” We (the human brain) can identify the meaning of this sentence easily: Humans who habitually commit illegal acts clandestinely transported a psychoactive drug into a country called the United States. We also infer all kinds of other things from our knowledge of the world: The cocaine probably came from South America, will be sold illegally for profit and consumed by people who will show certain changes in their behavior and emotions (probably pleasant for them, usually unpleasant for those around them) after consuming it, typically by snorting it up their noses, etc.

Let’s see how close to this understanding the computer can get with ontological semantics. First, OntoParser produces all potential senses of the words in the sentence and breaks the sentence up into clauses based on central events that are identified among the senses. The screen shots from OntoParser demo are shown below.

onto1a.gif

“The outlaws” has only one sense, CRIMINAL (note, we use capital letters to indicate that we’re talking about concepts to express a sense, not words in the sentence), but for “run” our system has all of 9 senses, from which it must pick, for example RUN, RUN-FOR-OFFICE, or SMUGGLE; “cocaine” has two related senses as DRUG, as has “United States”, a COUNTRY. With 1 x 9 x 2 x 2, this simple sentence has 36 potential meanings at this stage.

But not all these combinations are possible, CRIMINALs can’t FLOW a DRUG, for example. These are excluded by matching properties of the CONCEPTs in our world model, the ontology. FLOW, for example, allows for no agent, only a theme, and that theme must be a liquid. Neither CRIMINAL nor DRUG is a liquid, and only one of them could be fit into that EVENT anyway. The parser sets the 9 possible EVENTs and tries to fill all the other OBJECT senses in the sentence as participants in the EVENT.

The event SMUGGLE, allows for theme that must be a WEAPON, or ILLEGAL-DRUG, or IMMIGRANT.

onto4.gif

The parser fills all EVENTs with the possible PARTICIPANTs (case roles) from the sentence that it has chosen in the previous step. Then it weights the possible EVENTs and all combinations of their PARTICIPANTs, in terms of how well the PARTICIPANTs fit into the EVENTs.

For most EVENTs, CRIMINAL can fill the agent slot, but the other CONCEPTs fit nowhere; for fewer EVENTS, UNITED-STATES can be fit into theme or location, gaining them a higher score. Even fewer EVENTs can accommodate all three other CONCEPTs (some of them actually wrong). But SMUGGLE wins this race because ILLEGAL-DRUG fits closest to the theme it can take. So finally the parser outputs the text-meaning representations from the top scoring down to the lowest scoring.

onto3a.gif

This capability is the essence of semantic search where the concepts in a given query are matched to the concepts in Web pages. The range of applications that can use this technology includes summarization, categorization, classification, abstraction, machine translation, data mining, and more.

Information Pollution, the Murder She Wrote

February 14th, 2008 by Dr. Riza C Berkan, CEO

Readers of our Blog may remember my previous post about “information pollution.” A recent Google Blog post identifies one special form of it “… we have investigated billions of URLs and found more than three million unique URLs on over 180,000 web sites automatically installing malware.”

It is now evident that the decade-old strategy of “cover everything under the sun” is no longer a value-adding proposition to Web search due to the increasing rate of information pollution. Unfortunately, search engines that rely on statistical algorithms – but nothing else – will continue their uphill battle against information pollution, and against tactics that can easily generate baits for statistical algorithms.

About a decade or so ago, the Web was more of an unknown, and statistical methods needed to identify popular pages (for link terms) to approximate their legitimacy and relevancy. But now, credibility of the Web sites has become highly transparent. The authority map of the WWW is as tangible as the geographic map of the world. How much popularity is needed now?

The title I picked, inspired from the famous TV show, may sound too harsh, but I wanted to make one point clear. Google’s remarkable success by means of popularity algorithms was also the beginning of the SPAM industry. Commercial success sparked more information pollution, and now we are reaching the point of very low signal-to-noise ratios. Cleaning up the pollution is Google’s main concern, wisely so if they want to continue on the same path.

But for some of us, it is obvious that the popularity approach is no longer enough, and some fundamental changes are needed in search philosophy. Semantic search technology is one potential solution, as we are working on it day and night at hakia.

With semantics, information pollution can be decreased back to insignificant levels, like the clean-air technologies for energy production. In simple terms, the search algorithm will no longer take the link referrals, or any other author/user generated statistics, as the only means of rating a content. Semantic search algorithms will analyze the content for what it really means.

One thing is for sure that, if the spammers do not dissapear they will have to operate on a new level of ingenuity, perhaps as sophisticated as a poet or novelist, to generate content that can fool a semantic algorithm. That also depends on how good the semantic technology will be.

Beta Update at hakia.com

February 12th, 2008 by hakia Team

We have just updated the hakia Beta site. We are one step closer to our final target.

Did you notice the UI change and our initial pick of databases? An example speaks louder than long explanations. Search for “what causes dizziness.” You will notice that we picked ehow.com, Wikipedia and News (when appropriate) to bring you quality search results at the top of the search result page. There will be more to come. Stay tuned.

Nested “deep” search: When you click on the “more +” button next to a search result, one from Wikipedia for example, hakia will search the query, “what causes dizziness,” within Wikipedia to bring you the most relevant results.

deep.gif

Join the hakia Club! You can now join the h-Club and keep abreast of the developments under the hood of our engine. As an h-Club member, you can compare hakia to your favorite search engine, submit your Web page for QDEXing and see what is in the development pipeline.

We will post individual blog entries about our new products and features. hakia Club will be next. Did you join?