Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 63738 invoked from network); 5 Apr 2007 18:21:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Apr 2007 18:21:42 -0000 Received: (qmail 17195 invoked by uid 500); 5 Apr 2007 18:21:42 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 17170 invoked by uid 500); 5 Apr 2007 18:21:42 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 17159 invoked by uid 99); 5 Apr 2007 18:21:42 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2007 11:21:42 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [206.190.38.56] (HELO web50302.mail.re2.yahoo.com) (206.190.38.56) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 05 Apr 2007 11:21:35 -0700 Received: (qmail 96913 invoked by uid 60001); 5 Apr 2007 18:21:13 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Message-ID; b=In1t/i54m7Wzcdxu58Q15akODnZkwXLG3maUO7iET7oUrdiiyng3qpROiOhNPkX1rO/wkXUWwdj3CXscn8Xe+h7FVVdW+tlklhnBmYGiXC5Syu+3OPKI07VcWjK09/50g1tbZ3wttlaKwXJ/3Yz+3wcby69HzWEOcJphPVFUItU=; X-YMail-OSG: nB9TTCMVM1km9sscpPESEW7E7oDhbkDYzV0HK7q6avygmTxouZUlkgQi.MpUeorFlwiTyUTfG4ruQpOQ8PuunWYDNA-- Received: from [66.92.187.33] by web50302.mail.re2.yahoo.com via HTTP; Thu, 05 Apr 2007 11:21:13 PDT X-Mailer: YahooMailRC/476 YahooMailWebService/0.7.41.8 Date: Thu, 5 Apr 2007 11:21:13 -0700 (PDT) From: Otis Gospodnetic Subject: Re: short documents = help me tweak Similarity?? To: java-user@lucene.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Message-ID: <925086.96042.qm@web50302.mail.re2.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org As far as I know, this is the case where you want your custom Similarity that knows how to deal with a small number of terms. public float lengthNorm(String fieldName, int numTerms) { if (numTerms < N) // return something smart return (float)(1.0 / Math.sqrt(numTerms)); } I think the rest of what you said is correct. Look at this piece of Similarity javadoc: * However the resulted norm value is {@link #encodeNorm(float) encoded} as a single byte * before being stored. * At search time, the norm byte value is read from the index * {@link org.apache.lucene.store.Directory directory} and * {@link #decodeNorm(byte) decoded} back to a float norm value. * This encoding/decoding, while reducing index size, comes with the price of * precision loss - it is not guaranteed that decode(encode(x)) = x. * For instance, decode(encode(0.89)) = 0.75. * Also notice that search time is too late to modify this norm part of scoring, e.g. by * using a different {@link Similarity} for search. If you come up with a more generic lengthNorm that dels with "overly short" documents/fields well, I'd love to know! :) Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: John Kleven To: java-user@lucene.apache.org Sent: Thursday, April 5, 2007 1:45:34 PM Subject: Re: short documents = help me tweak Similarity?? Sorry to re-post -- is this the correct forum for questions like this? I think that writing a new encode/decode operation should help alleviate my problem, but thought that this must be fairly widespread issue for people using lucene for "non-web-page" searches (i.e., shorter documents) Thanks again, John On 4/2/07, John Kleven wrote: > > My documents are cars... > i.e., > Nissan Altima Sports Package > Nissan Altima Standard > > The problem I have is when i search "Nissan Altima", I want to get the 2nd > hit back first, i.e. "Nissan Altima Standard", because it is shorter. > However, this doesn't happen. They are both scored the exact same. > > I know that the lengthNorm in Similarity is using 1/sqrt(numTerms), and > you would think that would be enuff to make sure the order is correct. > However, it is not, and I assume this is because of the encode/decode > functions that pack this value into a single byte do not have the > granularity to represent differences between numbers like 1/sqrt(3) vs > 1/sqrt(4)?? > > Is the suggested approach here to re-write the encode/decode operations, > or is there any easier way? > > Thanks kindly - > John --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org