lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: asktog on search problems
Date Fri, 21 May 2004 23:49:23 GMT
This is not specific advice, but an idea that I think Google leverages 
to build up search corrections.  If a user searches for "100AW" and it 
doesn't match, but a moment later they try something different and 
immediately get to a product page, the system can make a loose 
connection between their original search and the product they soon 
thereafter found.  Over time, the connections get stronger because 
others will do the same thing.

I think term vectors could factor into making latent connections 
somehow also.

Just postulating...


On May 21, 2004, at 12:09 PM, David Spencer wrote:

> Haven't seen this discussed here.
> See 7a at the link below:
> 7a talks about searching on a camera site for the "Lowepro 100 AW".
> He says this query works:        "Lowepro 100 AW"
> and this query does not work: "Lowepro 100AW"
> Cross checking with google indeed shows that the 1st form is much more 
> popular, however the 2nd form is used, and if you're a commerce site 
> or a site that wants to make it easier for users to find things you 
> should help them out.
> So the discussion question is what's the best way to handle this.
> I guess the somewhat general form of this is that in a query, and term 
> might be split into 2 terms that are individually indexed (so "100AW" 
> is not indexed, but "100" and "AW" is).
> In a way the flip side of this is that any 2 terms could be 
> concatenated to form another term that was indexed (so in another 
> universe it might be that passing "100 AW" is not as precise as 
> passing "100AW" but how's the user to know).
> In the context of Lucene ways to handle this seem to be:
> - automagically run a fuzzy query (so if a query doesn't work, 
> transform "Lowepro 100AW" to "Lowepro~ 100AW~"
> - write a query parser that breaks apart unindexed tokens into ones 
> that are indexed (so "100AW" becomes "100 AW")
> - write a tokenizer that inserts dummy tokens for every pair of 
> tokens, so the stream "Lowepro 100 AW" would also have "Lowepro100" 
> and "100AW" inserted, presumably via magic w/
> Comments on best way to handle this?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message