Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Message-ID;
  b=C7CO0ZDEn3ejYDg/lDm2jvb6QWx39rRHPEBL1s6QGa2e+hZkWaIHcLFLVlfWNpVuNC7R1BzR+edEra0y500XttOiycaaZnbdCVZtUOVq4ZoxYbNwABqEXh2c26CcRf6UlPRUqoiuTDXEsiJ9HXCA/aDomf7aLZFpHzf4IR+L8Z0=;
Date: Thu, 5 Apr 2007 23:09:28 -0700 (PDT)
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Subject: Re: short documents = help me tweak Similarity??
To: java-user@lucene.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
Message-ID: <850041.71892.qm@web50304.mail.re2.yahoo.com>

John,

Look at coord Similarity method.  That may help you solve the ....  e.g., "Nissan Altima Sports Package" will be
the #1 hit even though there was an exact document matching every term.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: John Kleven <jkleven@vinquire.com>
To: java-user@lucene.apache.org
Sent: Friday, April 6, 2007 12:13:30 AM
Subject: Re: short documents = help me tweak Similarity??

Thank you kindly for the responses.

This was the solution that I dreamed up initially as well (overriding
lengthNorm) and making the returned values for small numTerms values (e.g. 3
and 4) more discrete.

So I did that in multiple ways, and I ran into a different problem.  If
lengthNorm returns something like this...
numTerms    return from lengthNorm()
1                  2.4
2                  2.0
3                  1.6
4                  1.2
5                  0.8
>5                1/sqrt(numTerms)

...you wind up with a weird situation.  What happens is that for a document
like "Nissan Altima SE Sports Package Sunroof CD 2003" -- imagine someone
searching for that exact document with the exact search string (i.e, they
literally search "Nissan Altima SE Sports Package Sunroof CD 2003") then the
*wrong* hits bubble to the top, e.g., "Nissan Altima Sports Package" will be
the #1 hit even though there was an exact document matching every term.

So then i tried a bunch of different lengthNorm implementations (because
obviously the example lengthNorm above is drastic) ... some drastic (ala the
example above) and some less drastic but still hopefully allowing a big
enough lengthNorm value so that encode/decode would still see a difference
in search length.

And i've yet to be able to find a happy medium that works.  I know that i'm
doing something stupid, or not understanding the scoring algorithm enough (
i.e., there's another part of the similarity class that should take into
account when you have lots of search terms and corresponding matches) but I
just couldn't find a way to make it happen.

This is why i was wondering if could just overload encode/decode in both my
indexer and searcher, and thereby just avoid this whole thing entirely.  But
apparently that isn't a solution either?

Also, i don't understand why the encode/decode functions have a range of
7x10^9 to 2x10^-9, when it seems to me the most common values are (boosts
set to 1.0) something between 1.0 and 0.  When would somebody have a monster
huge value like 7x10^9?  Even with a huge index time boost of 20.0 or
something, why would the encode/decode need a range as huge as the current
implementation?

I know I am missing something.  Thanks so much for the reponses, I've been
playing with this for so long I knew it was time to post a message and get
to the bottom of this.  Thank u again -
J


On 4/5/07, Chris Hostetter <hossman_lucene@fucit.org> wrote:
>
>
> : The problem comes when your float value is encoded into that 8 bit
> : field norm, the 3 length and 4 length both become the same 8 bit
> : value.  Call Similarity.encodeNorm on the values you calculate for the
> : different numbers of terms and make sure they return different byte
> : values.
>
> bingo.  You can't change encodeNorm and decodeNorm (because they are
> static -- they need to work at query time regardless of what Similarity
> was used at index time) but you can change the function of your length
> norm to make small deviations in short lengths significant enough to
> register even with the encoding.
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org