Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Mime-Version: 1.0 (Apple Message framework v734)
In-Reply-To: <88c6a6720510030156g4a2d1851lb2078faf5e716779@mail.gmail.com>
References: <e5de07e90510030047m30c1de69p3028a9e3350ec661@mail.gmail.com>
 <88c6a6720510030156g4a2d1851lb2078faf5e716779@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <863D0CEC-AB7F-4A90-ABA9-F7A3F2F729CD@ehatchersolutions.com>
Content-Transfer-Encoding: 7bit
From: Erik Hatcher <erik@ehatchersolutions.com>
Subject: Re: Reordering search results
Date: Mon, 3 Oct 2005 06:02:14 -0400
To: java-dev@lucene.apache.org


On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
>> 1- Words in Document that are more close to original search terms  
>> have
>> a larger Score. For example, if I was searching for "wellcome",
>> Document("wellcome") must be better than Document("welcome")
>>
>
> I'm just "thinking outloud" here, but some ideas that come to mind
> are:  Index both the original text (with spelling errors), and the
> spelling-corrected text.  When you search, search on both the
> corrected text, and in a non-required query clause search on the
> uncorrected text, maybe boosted down a bit.  This way, if the spelling
> was correct, it will match both the original term and the corrected
> term (since they're the same), but a document with a misspelling would
> match only the corrected term.  You'll have to experiment with boosts
> and relevance/rankings here.
>
> Another idea is, if you know the number of misspellings made at
> indexing time (it seems like you do), then boost documents based on
> the number of spelling errors -- higher boost factor for fewer errors.

Another tip is that score is based on term frequency - so when  
tokenizing correct spellings, add multiple of the correct words to  
weight towards them.

>> 2- Documents that have search terms close to each other, have a  
>> larger
>> Score. For example, if I was searching for "welcome there",
>> Document("welcome there") must be better than Document("welcome all
>> there"). Note that "all" is a stop word in my implementation.
>>
>
> PhraseQuery with a high slop factor (MAX_INT works) scores higher for
> terms that are closer together.  You can construct the PhraseQuery
> yourself (programmatically), or QueryParser takes it as:
>
> "welcome there"~99999
>
> (with the quotes)  99999 is the slop factor, which means to accept
> documents where "welcome" is within 99999 positions from "there".

The issue is that "all" is a stop word, though.  The StopFilter does  
not leave a hole when stop words are removed, so indexing "welcome  
all there" is exactly the same as indexing "welcome there" as far as  
the index is concerned.  I started to address this situation in the  
1.4.x Lucene releases but it introduced a backward incompatible issue  
so we reverted.  Care must be taken on the Query side of things -  
PhraseQuery did not deal with anything but term position increments  
of 1, but this has been addressed in the latest codebase (in  
Subversion).

I built a PositionalStopFilter for and discussed these details in the  
Analysis chapter of "Lucene in Action" - it is available in the  
code .zip at http://www.lucenebook.com

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org