lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Reordering search results
Date Thu, 06 Oct 2005 12:40:31 GMT

On Oct 6, 2005, at 8:28 AM, Ahmed El-dawy wrote:
> Thanks for your help.
> I used PhraseQuery to boost close terms. I think of an idea for sop
> words but I don't know, if it has any drawbacks. I can index any dummy
> Token in place of all stop words. This token will never be searched
> but it will be counted as a Token and will make a space between words.
> Does this solution has any drawbacks?

There is no need to index a dummy token to make a space.  You can  
simply set the position increment on the 2nd token to be 2.. which  
means 2 positions past the last one.  The default is 1, meaning  
successive positions.

     Erik


>
>
> On 10/3/05, Joaquin Delgado <joaquin.delgado@oracle.com> wrote:
>
>> Chris, you may consider using a modified version of the Nutch  
>> analysis
>> (http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/ 
>> package-summary.html)
>> which has a very slick treatment of stopwords. Please refer to  
>> chapter
>> 4, page 145 of the Lucene in Action written by Eric and Otis for some
>> details about the nutch implementation.
>>
>> -- J.D.
>>
>> Erik Hatcher wrote:
>>
>>
>>>
>>> On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
>>>
>>>
>>>>> 1- Words in Document that are more close to original search  
>>>>> terms  have
>>>>> a larger Score. For example, if I was searching for "wellcome",
>>>>> Document("wellcome") must be better than Document("welcome")
>>>>>
>>>>>
>>>>
>>>> I'm just "thinking outloud" here, but some ideas that come to mind
>>>> are:  Index both the original text (with spelling errors), and the
>>>> spelling-corrected text.  When you search, search on both the
>>>> corrected text, and in a non-required query clause search on the
>>>> uncorrected text, maybe boosted down a bit.  This way, if the  
>>>> spelling
>>>> was correct, it will match both the original term and the corrected
>>>> term (since they're the same), but a document with a misspelling  
>>>> would
>>>> match only the corrected term.  You'll have to experiment with  
>>>> boosts
>>>> and relevance/rankings here.
>>>>
>>>> Another idea is, if you know the number of misspellings made at
>>>> indexing time (it seems like you do), then boost documents based on
>>>> the number of spelling errors -- higher boost factor for fewer  
>>>> errors.
>>>>
>>>
>>>
>>> Another tip is that score is based on term frequency - so when
>>> tokenizing correct spellings, add multiple of the correct words to
>>> weight towards them.
>>>
>>>
>>>>> 2- Documents that have search terms close to each other, have  
>>>>> a  larger
>>>>> Score. For example, if I was searching for "welcome there",
>>>>> Document("welcome there") must be better than Document("welcome  
>>>>> all
>>>>> there"). Note that "all" is a stop word in my implementation.
>>>>>
>>>>>
>>>>
>>>> PhraseQuery with a high slop factor (MAX_INT works) scores  
>>>> higher for
>>>> terms that are closer together.  You can construct the PhraseQuery
>>>> yourself (programmatically), or QueryParser takes it as:
>>>>
>>>> "welcome there"~99999
>>>>
>>>> (with the quotes)  99999 is the slop factor, which means to accept
>>>> documents where "welcome" is within 99999 positions from "there".
>>>>
>>>
>>>
>>> The issue is that "all" is a stop word, though.  The StopFilter does
>>> not leave a hole when stop words are removed, so indexing "welcome
>>> all there" is exactly the same as indexing "welcome there" as far as
>>> the index is concerned.  I started to address this situation in the
>>> 1.4.x Lucene releases but it introduced a backward incompatible  
>>> issue
>>> so we reverted.  Care must be taken on the Query side of things -
>>> PhraseQuery did not deal with anything but term position increments
>>> of 1, but this has been addressed in the latest codebase (in
>>> Subversion).
>>>
>>> I built a PositionalStopFilter for and discussed these details in  
>>> the
>>> Analysis chapter of "Lucene in Action" - it is available in the   
>>> code
>>> .zip at http://www.lucenebook.com
>>>
>>>     Erik
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>
>
> --
> regards,
> Ahmed Saad
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message