lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fei liu <>
Subject Re: Document category identification in query
Date Wed, 16 Dec 2009 02:36:44 GMT
Query classification is an interesting question and there are many papers
discussed this. For more infomation, you could refe these papers, "A
taxonomy of web search", "Understanding user goal in web search", "Our
winning solution to query classification in KDDCUP 2005".

In your question, i think you can do this by two steps. First, use query to
retrieve document, then use the category information of retrieveled
documents to classify the query using algorithm such as KNN. Second, use the
query and query classification information to retrieve documents again.

2009/12/15 Alex <>

> Hi,
> I am trying to expand user queries to figure out potential document
> categories implied in the query.
> I wanted to know what was the best way to figure out the document category
> that is the most relevant to the query.
> Let me explain further:
> I have created categories that are applied to documents I want to index.
> Some example categories are :
> Hotel
> Restaurant
> Fast Food
> Chinese Restaurant
> Church
> Bank
> Gas station
> I also am trying to create category aliases such as Chinese food can also
> be
> named Chinese restaurant with the same category unique ID.
> The documents I index have 1 primary category and 1...N  secondary
> categories such as :
> McDonalds will be categorized under Fast food as a primary category but
> also
> under Restaurant as secondary category.
> The London Pub at the corner of my street will be categorized as Pub as
> primary category and also as Bar, Food and Beverages, Restaurant, and Fast
> Food (since then also have takeaway burgers ;).
> This all gives me a set of categories that are quite clearly identified, as
> well as a set of category aliases even though I'm aware that I can't figure
> out all the possible aliases of all my categories. At least I have the most
> obvious ones.
> Now with all this, I wanted to know, with the help of Lucene (or any other
> efficient method),  how I could figure out the most relevant category (if
> any) behind a user query.
> For example :
> If my user looks for :
> "Chang's chinese restaurant" the obvious category should be "Chinese
> Restaurant"
> but if my user looks for
> "chines restauran"  (misspelled) the category should also be "Chinese
> Restaurant" (such as google is capable of doing)
> OR
> "chinese bistro" should probably also return me the category "Chinese
> Restaurant" since bistro is a very similar concept to "Retaurant" ...
> Once the category is identified I can then query the index for documents
> that match that category the best.
> What is the proper way to identify the most relevant category in a user
> query based on the above ?
> Should I consider any other better approach ?
> Any help appreciated.
> Many thanks
> Alex.

Liu Fei(刘飞)
Institute of Software
School of Electronics Engineering & Computer Science
Peking University
Beijing, 100871, PRC.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message