lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joaquin Delgado <joaquin.delg...@oracle.com>
Subject Re: Reordering search results
Date Mon, 03 Oct 2005 18:05:32 GMT
Chris, you may consider using a modified version of the Nutch analysis 
(http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/package-summary.html) 
which has a very slick treatment of stopwords. Please refer to chapter 
4, page 145 of the Lucene in Action written by Eric and Otis for some 
details about the nutch implementation.

-- J.D.

Erik Hatcher wrote:

>
> On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
>
>>> 1- Words in Document that are more close to original search terms  have
>>> a larger Score. For example, if I was searching for "wellcome",
>>> Document("wellcome") must be better than Document("welcome")
>>>
>>
>> I'm just "thinking outloud" here, but some ideas that come to mind
>> are:  Index both the original text (with spelling errors), and the
>> spelling-corrected text.  When you search, search on both the
>> corrected text, and in a non-required query clause search on the
>> uncorrected text, maybe boosted down a bit.  This way, if the spelling
>> was correct, it will match both the original term and the corrected
>> term (since they're the same), but a document with a misspelling would
>> match only the corrected term.  You'll have to experiment with boosts
>> and relevance/rankings here.
>>
>> Another idea is, if you know the number of misspellings made at
>> indexing time (it seems like you do), then boost documents based on
>> the number of spelling errors -- higher boost factor for fewer errors.
>
>
> Another tip is that score is based on term frequency - so when  
> tokenizing correct spellings, add multiple of the correct words to  
> weight towards them.
>
>>> 2- Documents that have search terms close to each other, have a  larger
>>> Score. For example, if I was searching for "welcome there",
>>> Document("welcome there") must be better than Document("welcome all
>>> there"). Note that "all" is a stop word in my implementation.
>>>
>>
>> PhraseQuery with a high slop factor (MAX_INT works) scores higher for
>> terms that are closer together.  You can construct the PhraseQuery
>> yourself (programmatically), or QueryParser takes it as:
>>
>> "welcome there"~99999
>>
>> (with the quotes)  99999 is the slop factor, which means to accept
>> documents where "welcome" is within 99999 positions from "there".
>
>
> The issue is that "all" is a stop word, though.  The StopFilter does  
> not leave a hole when stop words are removed, so indexing "welcome  
> all there" is exactly the same as indexing "welcome there" as far as  
> the index is concerned.  I started to address this situation in the  
> 1.4.x Lucene releases but it introduced a backward incompatible issue  
> so we reverted.  Care must be taken on the Query side of things -  
> PhraseQuery did not deal with anything but term position increments  
> of 1, but this has been addressed in the latest codebase (in  
> Subversion).
>
> I built a PositionalStopFilter for and discussed these details in the  
> Analysis chapter of "Lucene in Action" - it is available in the  code 
> .zip at http://www.lucenebook.com
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message