Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 83609 invoked from network); 3 Oct 2005 10:04:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 3 Oct 2005 10:04:10 -0000 Received: (qmail 92346 invoked by uid 500); 3 Oct 2005 10:04:07 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 92316 invoked by uid 500); 3 Oct 2005 10:04:06 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 92305 invoked by uid 99); 3 Oct 2005 10:04:05 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2005 03:04:05 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [69.55.225.129] (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2005 03:04:08 -0700 Received: by ehatchersolutions.com (Postfix, from userid 504) id 944D913E2007; Mon, 3 Oct 2005 06:03:38 -0400 (EDT) Received: from [172.16.1.101] (va-71-48-138-146.dhcp.sprint-hsd.net [71.48.138.146]) by ehatchersolutions.com (Postfix) with ESMTP id C80B713E2005 for ; Mon, 3 Oct 2005 06:02:24 -0400 (EDT) Mime-Version: 1.0 (Apple Message framework v734) In-Reply-To: <88c6a6720510030156g4a2d1851lb2078faf5e716779@mail.gmail.com> References: <88c6a6720510030156g4a2d1851lb2078faf5e716779@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <863D0CEC-AB7F-4A90-ABA9-F7A3F2F729CD@ehatchersolutions.com> Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: Reordering search results Date: Mon, 3 Oct 2005 06:02:14 -0400 To: java-dev@lucene.apache.org X-Mailer: Apple Mail (2.734) X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on javelina X-Spam-Level: X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-5.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.0.1 X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote: >> 1- Words in Document that are more close to original search terms >> have >> a larger Score. For example, if I was searching for "wellcome", >> Document("wellcome") must be better than Document("welcome") >> > > I'm just "thinking outloud" here, but some ideas that come to mind > are: Index both the original text (with spelling errors), and the > spelling-corrected text. When you search, search on both the > corrected text, and in a non-required query clause search on the > uncorrected text, maybe boosted down a bit. This way, if the spelling > was correct, it will match both the original term and the corrected > term (since they're the same), but a document with a misspelling would > match only the corrected term. You'll have to experiment with boosts > and relevance/rankings here. > > Another idea is, if you know the number of misspellings made at > indexing time (it seems like you do), then boost documents based on > the number of spelling errors -- higher boost factor for fewer errors. Another tip is that score is based on term frequency - so when tokenizing correct spellings, add multiple of the correct words to weight towards them. >> 2- Documents that have search terms close to each other, have a >> larger >> Score. For example, if I was searching for "welcome there", >> Document("welcome there") must be better than Document("welcome all >> there"). Note that "all" is a stop word in my implementation. >> > > PhraseQuery with a high slop factor (MAX_INT works) scores higher for > terms that are closer together. You can construct the PhraseQuery > yourself (programmatically), or QueryParser takes it as: > > "welcome there"~99999 > > (with the quotes) 99999 is the slop factor, which means to accept > documents where "welcome" is within 99999 positions from "there". The issue is that "all" is a stop word, though. The StopFilter does not leave a hole when stop words are removed, so indexing "welcome all there" is exactly the same as indexing "welcome there" as far as the index is concerned. I started to address this situation in the 1.4.x Lucene releases but it introduced a backward incompatible issue so we reverted. Care must be taken on the Query side of things - PhraseQuery did not deal with anything but term position increments of 1, but this has been addressed in the latest codebase (in Subversion). I built a PositionalStopFilter for and discussed these details in the Analysis chapter of "Lucene in Action" - it is available in the code .zip at http://www.lucenebook.com Erik --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org