lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex <azli...@gmail.com>
Subject Re: Document category identification in query
Date Sun, 20 Dec 2009 18:50:07 GMT
Hi !

Many thanks to both of you for your suggestions and answers!

What Weiwei Wang suggests is a part of the solution I am willing to
implement. I will definitely use the suggest-as-you-type approach in the
query form as it will allow for pre-emptive disambiguation and I believe,
will give very satisfying results.

However, search users are wild beasts and I can't count on them to always
use the given suggestions. All I can count on is very erratic, sparse and
ambiguous queries :) So I need an almost fool proof solution.

To answer your question :
"BTW, I do not understand why you need to know the category of user input"
I am trying to understand the user intent behind the query to filter out
results based on a given category of locations. If a user queries "Fast Food
in Nanjing" I don't want to return all the documents that contain the words
"Fast" and "Food" and "Nanjing". I use a custom algorithm to figure out the
intended location first. Then using the Spatial contrib I filter out the
results based on a given area that was identified earlier. Finally I sort
the results according to distance from the location point / centroid found
earlier.

 Identifying the category allows me two things :

1) Filter out irrelevant results : I dont want my resultset to include a
Supermarket in "Nanjing" where the "food" is fresh and service is "fast"
just because the query words were included in the description of the
location. Since I am using custom, distance based, sorting of the results, I
can't afford to have the supermarket be the top result because it is the
closest to the location centroid identified earlier. The user intent was
clearly "fast food" and not a supermarket !

2) Understand user intent to provide targetted advertizing.

3) Understanding the category of location a user is looking for also allows
to calculate more accurately the bounding box  = the maximum distance at
which the location should be located to be relevant to the user. A user
looking for Pizza in New York is expecting to have his results within a
radius of a maximum of 1 or 2 miles. If he is looking for a Theme Park he
will probably be willing to go further away to find it. So identifying the
category of the location the user is looking for lets me calculate the
didstance radius more acurately.




Fei Liu

Thanks a lot for the papers you pointed me to. I cam accross them earlier in
my research and re-reading them gave me new insights. However I believe that
the Two steps approach you are recomending is not very viable under heavy
load as it requires two passes on the index.
I believe however that Identifying the dominant category(ies) of the
resultset when no category could be clearly identified using the query
alone, can be very valuable if sent back to the user as an information and a
category link !

Now what I think I will do to pre-emptively identify the location
category(ies) implied in the query :

1 - use my own custom category set and index their names using the synonym
analyzer provided with Lucene and also use some sort of normalization such
as stemmin. maybe also using snowball analyzer.
2 - break the query into Shingles (word based grams) and analyze each
shingle using the analyzers that were used in (1). then query Lucene with
these analyzed shingles against the category index built earlier.

Hopefully the category with the highest Lucene score should be the one
intended by the user....

Later on, I also intend to use some sort of training based approach using
search queries that would have been tagged with the relevant location
categories.

What do you guys think ?

Would this be a viable approach ?

Thanks for all !


Cheers

Alex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message