lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fei liu <liufeipeki...@gmail.com>
Subject Re: Document category identification in query
Date Mon, 21 Dec 2009 13:22:40 GMT
that's a good idea. This approad will work.

Supposing someone submit a query "nice rice restaurant", I think he/she
wants to find a chinese restaurant. But you will find the category of
"restaurant" retrieved with higher score than "chinese restaurant".
Usually you may find no category retrieved by your approach, because users'
query doesn't share the same words with your categories' name.
But you could have a try. If you have interesting results, let me know.
Thanks!

That's my opinion.

2009/12/21 Alex <azlist1@gmail.com>

> Hi !
>
> Many thanks to both of you for your suggestions and answers!
>
> What Weiwei Wang suggests is a part of the solution I am willing to
> implement. I will definitely use the suggest-as-you-type approach in the
> query form as it will allow for pre-emptive disambiguation and I believe,
> will give very satisfying results.
>
> However, search users are wild beasts and I can't count on them to always
> use the given suggestions. All I can count on is very erratic, sparse and
> ambiguous queries :) So I need an almost fool proof solution.
>
> To answer your question :
> "BTW, I do not understand why you need to know the category of user input"
> I am trying to understand the user intent behind the query to filter out
> results based on a given category of locations. If a user queries "Fast
> Food
> in Nanjing" I don't want to return all the documents that contain the words
> "Fast" and "Food" and "Nanjing". I use a custom algorithm to figure out the
> intended location first. Then using the Spatial contrib I filter out the
> results based on a given area that was identified earlier. Finally I sort
> the results according to distance from the location point / centroid found
> earlier.
>
>  Identifying the category allows me two things :
>
> 1) Filter out irrelevant results : I dont want my resultset to include a
> Supermarket in "Nanjing" where the "food" is fresh and service is "fast"
> just because the query words were included in the description of the
> location. Since I am using custom, distance based, sorting of the results,
> I
> can't afford to have the supermarket be the top result because it is the
> closest to the location centroid identified earlier. The user intent was
> clearly "fast food" and not a supermarket !
>
> 2) Understand user intent to provide targetted advertizing.
>
> 3) Understanding the category of location a user is looking for also allows
> to calculate more accurately the bounding box  = the maximum distance at
> which the location should be located to be relevant to the user. A user
> looking for Pizza in New York is expecting to have his results within a
> radius of a maximum of 1 or 2 miles. If he is looking for a Theme Park he
> will probably be willing to go further away to find it. So identifying the
> category of the location the user is looking for lets me calculate the
> didstance radius more acurately.
>
>
>
>
> Fei Liu
>
> Thanks a lot for the papers you pointed me to. I cam accross them earlier
> in
> my research and re-reading them gave me new insights. However I believe
> that
> the Two steps approach you are recomending is not very viable under heavy
> load as it requires two passes on the index.
> I believe however that Identifying the dominant category(ies) of the
> resultset when no category could be clearly identified using the query
> alone, can be very valuable if sent back to the user as an information and
> a
> category link !
>
> Now what I think I will do to pre-emptively identify the location
> category(ies) implied in the query :
>
> 1 - use my own custom category set and index their names using the synonym
> analyzer provided with Lucene and also use some sort of normalization such
> as stemmin. maybe also using snowball analyzer.
> 2 - break the query into Shingles (word based grams) and analyze each
> shingle using the analyzers that were used in (1). then query Lucene with
> these analyzed shingles against the category index built earlier.
>
> Hopefully the category with the highest Lucene score should be the one
> intended by the user....
>
> Later on, I also intend to use some sort of training based approach using
> search queries that would have been tagged with the relevant location
> categories.
>
> What do you guys think ?
>
> Would this be a viable approach ?
>
> Thanks for all !
>
>
> Cheers
>
> Alex
>



-- 
-------------------------
Liu Fei(刘飞)
Institute of Software
School of Electronics Engineering & Computer Science
Peking University
Beijing, 100871, PRC.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message