lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Reordering search results
Date Mon, 03 Oct 2005 10:02:14 GMT

On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
>> 1- Words in Document that are more close to original search terms  
>> have
>> a larger Score. For example, if I was searching for "wellcome",
>> Document("wellcome") must be better than Document("welcome")
>>
>
> I'm just "thinking outloud" here, but some ideas that come to mind
> are:  Index both the original text (with spelling errors), and the
> spelling-corrected text.  When you search, search on both the
> corrected text, and in a non-required query clause search on the
> uncorrected text, maybe boosted down a bit.  This way, if the spelling
> was correct, it will match both the original term and the corrected
> term (since they're the same), but a document with a misspelling would
> match only the corrected term.  You'll have to experiment with boosts
> and relevance/rankings here.
>
> Another idea is, if you know the number of misspellings made at
> indexing time (it seems like you do), then boost documents based on
> the number of spelling errors -- higher boost factor for fewer errors.

Another tip is that score is based on term frequency - so when  
tokenizing correct spellings, add multiple of the correct words to  
weight towards them.

>> 2- Documents that have search terms close to each other, have a  
>> larger
>> Score. For example, if I was searching for "welcome there",
>> Document("welcome there") must be better than Document("welcome all
>> there"). Note that "all" is a stop word in my implementation.
>>
>
> PhraseQuery with a high slop factor (MAX_INT works) scores higher for
> terms that are closer together.  You can construct the PhraseQuery
> yourself (programmatically), or QueryParser takes it as:
>
> "welcome there"~99999
>
> (with the quotes)  99999 is the slop factor, which means to accept
> documents where "welcome" is within 99999 positions from "there".

The issue is that "all" is a stop word, though.  The StopFilter does  
not leave a hole when stop words are removed, so indexing "welcome  
all there" is exactly the same as indexing "welcome there" as far as  
the index is concerned.  I started to address this situation in the  
1.4.x Lucene releases but it introduced a backward incompatible issue  
so we reverted.  Care must be taken on the Query side of things -  
PhraseQuery did not deal with anything but term position increments  
of 1, but this has been addressed in the latest codebase (in  
Subversion).

I built a PositionalStopFilter for and discussed these details in the  
Analysis chapter of "Lucene in Action" - it is available in the  
code .zip at http://www.lucenebook.com

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message