lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject asktog on search problems
Date Fri, 21 May 2004 16:09:37 GMT
Haven't seen this discussed here.
See 7a at the link below:

http://www.asktog.com/columns/062top10ReasonsToNotShop.html

7a talks about searching on a camera site for the "Lowepro 100 AW".

He says this query works:        "Lowepro 100 AW"
and this query does not work: "Lowepro 100AW"

Cross checking with google indeed shows that the 1st form is much more 
popular, however the 2nd form is used, and if you're a commerce site or 
a site that wants to make it easier for users to find things you should 
help them out.

So the discussion question is what's the best way to handle this.

I guess the somewhat general form of this is that in a query, and term 
might be split into 2 terms that are individually indexed (so "100AW" is 
not indexed, but "100" and "AW" is).
In a way the flip side of this is that any 2 terms could be concatenated 
to form another term that was indexed (so in another universe it might 
be that passing "100 AW" is not as precise as passing "100AW" but how's 
the user to know).

In the context of Lucene ways to handle this seem to be:
- automagically run a fuzzy query (so if a query doesn't work, transform 
"Lowepro 100AW" to "Lowepro~ 100AW~"
- write a query parser that breaks apart unindexed tokens into ones that 
are indexed (so "100AW" becomes "100 AW")
- write a tokenizer that inserts dummy tokens for every pair of tokens, 
so the stream "Lowepro 100 AW" would also have "Lowepro100" and "100AW" 
inserted, presumably via magic w/ TokenStream.next()

Comments on best way to handle this?







---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message