lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex <azli...@gmail.com>
Subject Document category identification in query
Date Tue, 15 Dec 2009 04:23:43 GMT
Hi,

I am trying to expand user queries to figure out potential document
categories implied in the query.
I wanted to know what was the best way to figure out the document category
that is the most relevant to the query.

Let me explain further:
I have created categories that are applied to documents I want to index.
Some example categories are :

Hotel
Restaurant
Fast Food
Chinese Restaurant
Church
Bank
Gas station


I also am trying to create category aliases such as Chinese food can also be
named Chinese restaurant with the same category unique ID.


The documents I index have 1 primary category and 1...N  secondary
categories such as :

McDonalds will be categorized under Fast food as a primary category but also
under Restaurant as secondary category.
The London Pub at the corner of my street will be categorized as Pub as
primary category and also as Bar, Food and Beverages, Restaurant, and Fast
Food (since then also have takeaway burgers ;).

This all gives me a set of categories that are quite clearly identified, as
well as a set of category aliases even though I'm aware that I can't figure
out all the possible aliases of all my categories. At least I have the most
obvious ones.


Now with all this, I wanted to know, with the help of Lucene (or any other
efficient method),  how I could figure out the most relevant category (if
any) behind a user query.


For example :

If my user looks for :
"Chang's chinese restaurant" the obvious category should be "Chinese
Restaurant"
but if my user looks for
"chines restauran"  (misspelled) the category should also be "Chinese
Restaurant" (such as google is capable of doing)
OR
"chinese bistro" should probably also return me the category "Chinese
Restaurant" since bistro is a very similar concept to "Retaurant" ...


Once the category is identified I can then query the index for documents
that match that category the best.


What is the proper way to identify the most relevant category in a user
query based on the above ?
Should I consider any other better approach ?


Any help appreciated.


Many thanks

Alex.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message