Everything You Always Wanted to Know About Semantic Search, But Were Afraid to Ask (in SemTech Conferences)

June 24th, 2009 by Dr. Riza C Berkan, CEO

In the wake of SemTech09 conference, I thought this title would do justice to those mischievous readers who happened to have the good fortune to stumble across this blog posting. The conference was great, neatly organized, carefully secluded in San Jose, California. One of the highlights was the Semantic Search Keynote Panel with all the players on stage (Ask, Bing, Google, hakia, TrueKnowledge and Yahoo!) as seen in the picture below.

semtech09-panel

Bear in mind that semantic technology to “any” audience can be as heavy and stifling as what the topic of stem-cell research can be to the high-school students. Thanks to Carla Thompson from Guidewire who did a terrific job to come up with discussion topics and moderating the panel, everyone survived the ordeal without any sign of dozing.

Despite the positive outcome, some responses from the panelist made me wonder if we should go back to the basic question of “What is semantic search?” Or, better to discuss: what is NOT semantic search? Here is my list:

Structured data. Folks, structured data is NOT semantic technology. A database that can pull out a list of beer brands, their manufacturers, and their contact information, given the query “social drinking”, has nothing to do with semantics. I say this because some people seemed to be under the illusion that there must be some kind of semantic technology if a search engine brings such structured data in SERP. It is a trick as old as the ancient Egyptians who used beads on strings to organize harvesting information. Organized information is not semantics.

Morphology. If a search engine is robust (brings the same results) to a query “top ten” versus “top 10″ by recognizing “ten=10″ it would be a stretch of imagination to call it semantic. Anyone can come up with such a replacement list without a drop of linguistic knowledge. Similarly, distinguishing the name Fisher from the noun fisher by detecting the capitalization of the first letter does not go beyond the application of simple linguistic rules. These capabilities are not semantic search capabilities.

Syntax. It is true that certain level of semantic information can be salvaged from syntax. Unfortunately, if syntax was enough to detect the meaning of text, then an 8 year old kid who developed a perfect reading skill (syntactically parsing strings of letters and words in English) would be expected to understand the meaning of Shakespeare’s works. The difference between reading and understanding is the difference between syntax and semantics. Former requires the skill to parse things out, whereas the latter requires vast amount of associative knowledge.

Statistics. An infinite number of monkeys with a keyboard would eventually type the complete text of the declaration of independence. This is statistically correct. However, if a search engine is expected to become semantically apt using statistical algorithms, one has to wait until the monkeys finish their job. There is no place for statistics in semantics. For example, let’s take this sentence: “Polar bears don’t eat alligator eggs before dawn.” I am sure you have never seen this combination of words before in your life. But, the fact that you can understand what it means is simple evidence that semantic brain does not need statistical sampling. Meaning does not emerge from statistics. It emerges from associative knowledge.

Scalability. Scalability is the narrow bridge between science and technology. What you can carry from the science side to the technology side over this bridge determines the level of capabilities in real world. The science of semantics is huge stemming from the basics of philosophy. But, Web search is a highly particular problem with stringent constraints (narrow bridge). Designing semantic algorithms to drive a Web search engine is like walking on egg shells and requires a completely new approach. Therefore, a semantic algorithm can be very sophisticated but it does not mean it is a semantic search algorithm suitable for the Web.

The five issues I addressed above explain what is NOT semantic search and should guide the interested readers to question emerging technologies in SemTech10. Structured data, morphology, syntax, statistics, and scalability are the key questions to discuss. Obviously, no one would be afraid to ask these questions unlike what the title suggests, but if you understood the title, it was your semantic brain in action. That was my last example to “what is semantics” in this article.

6v3rapbshx

Inspired by hakia, Bing introduces categorized search

June 2nd, 2009 by Melek Pulatkonak, COO

catsearchBing, the new search engine from Microsoft just went live and in doing so introduced a similar version of hakia’s categorized search. At its launch in 2006, hakia became the first search engine to provide categorized aspects of search queries via hakia Galleries.

hakia Galleries received industry accolades after their formal introduction in 2007. Our goal has always been to take search beyond 10 blue links. It was then no surprise when Microsoft invited us to show them the inner workings of the hakia Galleries in July 2008- shortly after their acquisition of Powerset. But it was a huge surprise to recently find out that Microsoft introduced categorized search in Bing. Today we checked out the Bing preview and compared the Bing’s categorized search feature to its inspiration, hakia Galleries.

hakia Galleries provide categorized aspects of search queries. For example, if you are searching for Obama, you can find information about his official site, headline news, images, biography, speeches, and more (see image below). Powered by semantic search, hakia Galleries prove 17 aspects of this query. We save the user time by answering 17 Obama related questions in one search. Compare the hakia Obama gallery with the same search at Bing.com (Bing provides only 7 aspects of this search query).

hakiaobama
bingobama1

Let’s look at another example. Search for lung cancer at hakia and Bing. hakia provides the searcher links for the following aspects of this query: Basic Information and FAQ, Image Search, Headline News, Symptoms and Diagnostics, Treatment, Procedures, and Therapy, News, Clinical Trials, Healthcare Facilities and Finding a Physician, Alternative Therapy, For Kids, Research and Statistics, Organizations, Message Boards, and Images. Compare that with Bing’s aspects: articles, symptoms, treatment, prognosis, stages, clinical trials, and images. Look familiar?

hakialc
binglc1

As Danny Sullivan put it aptly in his Bing review, “Probably the most significant change is that Bing now organizes search results into categories (gives Obama example)…The concept of grouping results also isn’t new. Long known as clustering, you can see it in operation at hakia (see Obama there) or Clusty (again, see Obama there).”

At hakia we could not dream of a marketing budget of $80-100 million. But hey, if you are out there to try Bing as an alternative search engine to Google, give the original categorized search a try at hakia.com (one of Bing’s inspirations!). You can surf the hakia Galleries here: http://gallery.hakia.com/ or try your search at hakia.com when you bing and ding.

A New Contextual Advertising Technology from hakia: CONTEXA, launched at ReadWriteWeb

May 19th, 2009 by Kartal Guner, Chief Architect

We are happy to announce that we have launched our new contextual advertising module of our semantic advertising system: CONTEXA. ReadWriteWeb (RWW), one of the world’s top 20 most popular blogs according to Technorati, is our first partner.

CONTEXA provides page-level contextual analysis on-the-fly and outputs keywords that represent the meaning of the page along with their meaning score. CONTEXA is offered as a service and can be integrated into any ad system. RWW has integrated CONTEXA where our system matches the contextual representation of a blog page with sponsors’ requirements on-the-fly to provide relevant ads to RWW readers for a richer experience. The red box in the image below shows this step.

rww

We believe that more relevant contextual ads will bring the return of contextual advertising closer to paid-search levels with the ripple-effect of increased CTR- conversion rates- revenue. CONTEXA is powered by hakia’s semantic core technology. To see how CONTEXA works, you can visit our CONTEXA page.

We had shared with our readers the comparison demo of hakia’s contextual capabilities with that of AdSense and Yahoo in the fall. We did not have a chance to do a comparison with Microsoft’s PubCenter. As we move along with the ReadWriteWeb’s implementation of CONTEXA, we will report about lessons learned and milestones marked.

We are excited to keep the wheels of innovation turning at hakia as our industry has plenty room for improvement. Today, Web users are overwhelmed with the quantity and suffer from the quality of display ads and quickly learn to ignore a good portion of the Web pages they visit. In the long run, the industry’s focus will have shift to increasing ad quality and limiting the supply to increase value. The path to this promise goes through enhancements to both contextual and behavioral ad targeting technologies. We are happy to partner with ReadWriteWeb, a kindred-spirited innovator, for the beginning of a journey to provide more relevant contextual ads .

To learn more about CONTEXA, please contact bdev at hakia.com We are more than happy to set you up with a custom demo.

Once again, hakia is a Webware 100 finalist – Please Vote!

April 2nd, 2009 by hakia Team

webware100.jpghakia has experienced amazing momentum over the past year, and we are proud to announce that we are once again a finalist for the prestigious CNET Webware 100 awards! The Webware 100 Awards recognize the 100 best Web 2.0 applications, chosen by Webware readers and Internet users across the globe.

Last year, over 1.9 million votes were cast last year to select the winners, including hakia in the “Search and Reference” category. To make that a reality once again, please vote for us here!

Thanks to our community of users for supporting the search engine and recognizing the importance of semantic technology for the future of the Internet. We look forward to more progress to come as we near completion of development.

Automated Categorization of Search Results, a New Era?

March 23rd, 2009 by hakia Team

Since the hakia Galleries have been on-line, we have received nothing but appraisals. Our proprietary approach to “Aspect Categorization” shines with examples in topics ranging from music to health. We currently cover more than a million popular queries.

hakia’s fully automated gallery production where the search results are categorized according to the query can be seen in the following demo link where 1425 different car brands and models are covered.

Car Brands & Models.

This is part of our ongoing effort to spread this capability to all search queries, effectively creating a new organization of the content on the entire Web, in a way as distinct as how Wikipedia invented its own style.

Microsoft’s recent news about KUMO and its screen-shots leave no doubt that some people are already convinced this is the way to progress in search.

Aspect categorization is different than what some search engines are already doing. For example, dividing the SERP into Web Results, Videos, News, Images, etc., is not aspect categorization. However, when the categories are related to the query, such as Obama’s Speeches and Quotes, Obama’s Fans, etc., (for the query Obama) then it is aspect categorization.

Aspect categorization in search is a tough business, it requires carefull off-line analysis to determine how the categories are going to be decided algorithmically, resources will be identified for crawling, and how the results will be detected to fit in.

The effectiveness of this approach in the broad search space is yet to be seen, and the users will have the last word as always. The tech bloggers and authors will be able to make their own judgment and recognize the limitations and imitations. In light of our patent application in progress, we are also anxious to see where all this leads to. Some exciting times ahead. Until then happy searching at hakia.

Books, Bytes and Trees: What Do You Know?

February 11th, 2009 by hakia Team

We put together a fun quiz and invite you to stop thinking about the economy/stimulus package/your job and take a moment to ponder about the size of information overload/resources/pollution in the Internet age. We think about searching it better- all the time!

Here is a teaser, the first question.

hakiaquiz

Take the hakia Quiz now at http://company.hakia.com/quiz/quiz1.html. Enjoy!

hakia ScoopBar, Now Highlights Pages Found by Other Search Engines.

February 5th, 2009 by hakia Team

A new version of hakia Scoopbar (both for IE and FireFox browsers) has just been released. This version highlights search results in the opened Web pages that are found by hakia, as well as Google, Yahoo, Live, or any other search engine.

An example is shown below for the query “roman invasion of jerusalem” using Google.

gog1

With the hakia Scoopbar installed and Highlight button activated (as shown above), you can open long documents and the search result will be located on the page by automatic scrolling and highlighting (as shown below.)

gog2

Auto-highlighting is increasingly becoming more important to tackle the 2nd search problem especially for longer documents. hakia team is committed to improve this functionality for the Web searchers.

Note that hakia ScoopBar does not monitor user behavior, does not track Web traffic, and comes with uninstall option. Give it a try and let us know your opinion.

Making Quality the Key to Web Searches

January 13th, 2009 by hakia Team

We are happy to share a commentary by our CEO, Dr. Riza Berkan, for the Project Syndicate that was published in the Japan Times:

In the not-so-distant future, students will be able to graduate from high school without ever touching a book. Twenty years ago, they could graduate from high school without ever using a computer. In only a few decades, computer technology and the Internet have transformed the core principles of information, knowledge, and education.

To read the full article click here.

Project Syndicate is an international association of quality newspapers devoted to bringing distinguished voices from across the world to local audiences everywhere, strengthening the independence
and upgrading their journalistic, editorial, and business capacities.

Did Someone Just Expose Semantic Data?

January 12th, 2009 by Dr. Riza C Berkan, CEO

This is a response to Marshall Kirkpatrick’s recent post Did Google Just Expose Semantic Data in Search Results?.

There have been many trivializing depictions of semantic search and semantic Web in the blogosphere, so much so that I might have developed an allergic reaction reading them. However, Marshall is doing the right thing by provoking us to define this space better.

First of all, what is “semantic data”? I think what this means is “syntactic extraction” as I followed the examples described. The extraction problem by fitting syntactic patterns, sorry to disappoint some of you folks, is really not semantic analysis. Extraction problem has been around many years, and is being implemented all over the market in enterprise (and government) applications.

Take a word pattern “what is the capital of –” or “what is the capital city of –”. Then, obtain a two column list from the Web of the capital cities around the world. After 12 minutes 34 seconds programming, you will have an extraction algorithm (extraction from the query) just as how Google does in these examples… This is not semantic analysis.

One step further, you can sit down and define patterns until the cows come home, and end up with a large library of extraction algorithms. You might scan through Wikipedia to collect data (if you don’t care proper authorship and credibility). Then you will have something useful, no doubt about it. However, these are not to be considered as semantic analyses.

Bruno Haid expressed his concern by using the terminology “structured versus unstructured platform” for the target of extraction. That is still not enough differentiation between syntax versus meaning in my book. For anything to be considered “semantic” there has to be a model of understanding, involving concepts and associations.

I recommend an old article written by George A. Miller on the ambiguity of words which should inspire a thought as to why syntax-only approach cannot replace meaning. We had posted a fun example here following Bill Gates’ vision. An example of semantic parsing was also posted here previously.

The most important question is how to implement semantic analysis in a search engine environment. The examples in Marshall’s post do not come close to any kind of semantic analysis beyond simple extraction operation. Google has not shown any clues to make us think of an actual semantic back-end yet.

Search Box: Keep Your Curious Visitors on Site

January 6th, 2009 by hakia Team

With the start of 2009, we have just released a new and improved version of hakia Search Box. To see how it works, go to Search Box Page.

One immediate distinguishing feature of hakia Search Box is its flexibility to search in multiple domains as shown below.

searchbox

The second distinguishing feature is its sentence highlighting and semantic precision (especially with complex, long-tail, and unusual queries) as shown below. Note the uninterrupted text snippets (no ellipses) for Pubmed and health searches.

searchbox3

There are several ASP and PHP examples on the page with design options as outlined below:

- Web Plus Search (multiple domains as shown above)
- Site search (pick a site to search only its content)
- Pubmed search (search results from 10 million pubmed articles)
- Health search (search results from credible Web sources on health)

It is free up to 30,000 searches per day (which is the highest number offered to date).

Why do you need a good search box on your site? Well, you don’t want those curious visitors to leave your site and go to a search engine. With a good search box, you will keep them on your Web property.

If you already have a search box on your Web site and you are not sure what to do, you can add hakia’s search box as a semantic search option.

Give it a try and let us know.