Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 8257 invoked from network); 6 Apr 2007 06:10:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Apr 2007 06:10:00 -0000 Received: (qmail 74457 invoked by uid 500); 6 Apr 2007 06:09:57 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 74434 invoked by uid 500); 6 Apr 2007 06:09:57 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 74423 invoked by uid 99); 6 Apr 2007 06:09:57 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2007 23:09:57 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [206.190.38.58] (HELO web50304.mail.re2.yahoo.com) (206.190.38.58) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 05 Apr 2007 23:09:49 -0700 Received: (qmail 71962 invoked by uid 60001); 6 Apr 2007 06:09:28 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Message-ID; b=C7CO0ZDEn3ejYDg/lDm2jvb6QWx39rRHPEBL1s6QGa2e+hZkWaIHcLFLVlfWNpVuNC7R1BzR+edEra0y500XttOiycaaZnbdCVZtUOVq4ZoxYbNwABqEXh2c26CcRf6UlPRUqoiuTDXEsiJ9HXCA/aDomf7aLZFpHzf4IR+L8Z0=; X-YMail-OSG: y0wAFWQVM1m605x6jCQGKaio98CN7S1t5_mVsXwkvJC3ZQd2H34lnA7W0ky9QR6elay0fkFzL4Fmcj0naoWnDTMvGg-- Received: from [63.192.54.62] by web50304.mail.re2.yahoo.com via HTTP; Thu, 05 Apr 2007 23:09:28 PDT X-Mailer: YahooMailRC/476 YahooMailWebService/0.7.41.8 Date: Thu, 5 Apr 2007 23:09:28 -0700 (PDT) From: Otis Gospodnetic Subject: Re: short documents = help me tweak Similarity?? To: java-user@lucene.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Message-ID: <850041.71892.qm@web50304.mail.re2.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org John, Look at coord Similarity method. That may help you solve the .... e.g., "Nissan Altima Sports Package" will be the #1 hit even though there was an exact document matching every term. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: John Kleven To: java-user@lucene.apache.org Sent: Friday, April 6, 2007 12:13:30 AM Subject: Re: short documents = help me tweak Similarity?? Thank you kindly for the responses. This was the solution that I dreamed up initially as well (overriding lengthNorm) and making the returned values for small numTerms values (e.g. 3 and 4) more discrete. So I did that in multiple ways, and I ran into a different problem. If lengthNorm returns something like this... numTerms return from lengthNorm() 1 2.4 2 2.0 3 1.6 4 1.2 5 0.8 >5 1/sqrt(numTerms) ...you wind up with a weird situation. What happens is that for a document like "Nissan Altima SE Sports Package Sunroof CD 2003" -- imagine someone searching for that exact document with the exact search string (i.e, they literally search "Nissan Altima SE Sports Package Sunroof CD 2003") then the *wrong* hits bubble to the top, e.g., "Nissan Altima Sports Package" will be the #1 hit even though there was an exact document matching every term. So then i tried a bunch of different lengthNorm implementations (because obviously the example lengthNorm above is drastic) ... some drastic (ala the example above) and some less drastic but still hopefully allowing a big enough lengthNorm value so that encode/decode would still see a difference in search length. And i've yet to be able to find a happy medium that works. I know that i'm doing something stupid, or not understanding the scoring algorithm enough ( i.e., there's another part of the similarity class that should take into account when you have lots of search terms and corresponding matches) but I just couldn't find a way to make it happen. This is why i was wondering if could just overload encode/decode in both my indexer and searcher, and thereby just avoid this whole thing entirely. But apparently that isn't a solution either? Also, i don't understand why the encode/decode functions have a range of 7x10^9 to 2x10^-9, when it seems to me the most common values are (boosts set to 1.0) something between 1.0 and 0. When would somebody have a monster huge value like 7x10^9? Even with a huge index time boost of 20.0 or something, why would the encode/decode need a range as huge as the current implementation? I know I am missing something. Thanks so much for the reponses, I've been playing with this for so long I knew it was time to post a message and get to the bottom of this. Thank u again - J On 4/5/07, Chris Hostetter wrote: > > > : The problem comes when your float value is encoded into that 8 bit > : field norm, the 3 length and 4 length both become the same 8 bit > : value. Call Similarity.encodeNorm on the values you calculate for the > : different numbers of terms and make sure they return different byte > : values. > > bingo. You can't change encodeNorm and decodeNorm (because they are > static -- they need to work at query time regardless of what Similarity > was used at index time) but you can change the function of your length > norm to make small deviations in short lengths significant enough to > register even with the encoding. > > > > > -Hoss > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org