lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Hudson" <andrewhud...@gmail.com>
Subject Re: short documents = help me tweak Similarity??
Date Thu, 05 Apr 2007 20:03:35 GMT
The problem comes when your float value is encoded into that 8 bit
field norm, the 3 length and 4 length both become the same 8 bit
value.  Call Similarity.encodeNorm on the values you calculate for the
different numbers of terms and make sure they return different byte
values.

Andrew

On 4/5/07, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
> As far as I know, this is the case where you want your custom Similarity that knows how
to deal with a small number of terms.
>
>   public float lengthNorm(String fieldName, int numTerms) {
>     if (numTerms < N)
>       // return something smart
>     return (float)(1.0 / Math.sqrt(numTerms));
>   }
>
> I think the rest of what you said is correct.  Look at this piece of Similarity javadoc:
>
>
>  *      However the resulted <i>norm</i> value is {@link #encodeNorm(float)
encoded} as a single byte
>  *      before being stored.
>  *      At search time, the norm byte value is read from the index
>  *      {@link org.apache.lucene.store.Directory directory} and
>  *      {@link #decodeNorm(byte) decoded} back to a float <i>norm</i> value.
>  *      This encoding/decoding, while reducing index size, comes with the price of
>  *      precision loss - it is not guaranteed that decode(encode(x)) = x.
>  *      For instance, decode(encode(0.89)) = 0.75.
>  *      Also notice that search time is too late to modify this <i>norm</i>
part of scoring, e.g. by
>  *      using a different {@link Similarity} for search.
>
>
> If you come up with a more generic lengthNorm that dels with "overly short" documents/fields
well, I'd love to know! :)
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> ----- Original Message ----
> From: John Kleven <jkleven@vinquire.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, April 5, 2007 1:45:34 PM
> Subject: Re: short documents = help me tweak Similarity??
>
> Sorry to re-post -- is this the correct forum for questions like this?  I
> think that writing a new encode/decode operation should help alleviate my
> problem, but thought that this must be fairly widespread issue for people
> using lucene for "non-web-page" searches (i.e., shorter documents)
>
> Thanks again,
> John
>
> On 4/2/07, John Kleven <johnkleven@gmail.com> wrote:
> >
> > My documents are cars...
> > i.e.,
> > Nissan Altima Sports Package
> > Nissan Altima Standard
> >
> > The problem I have is when i search "Nissan Altima", I want to get the 2nd
> > hit back first, i.e. "Nissan Altima Standard", because it is shorter.
> > However, this doesn't happen.  They are both scored the exact same.
> >
> > I know that the lengthNorm in Similarity is using 1/sqrt(numTerms), and
> > you would think that would be enuff to make sure the order is correct.
> > However, it is not, and I assume this is because of the encode/decode
> > functions that pack this value into a single byte do not have the
> > granularity to represent differences between numbers like 1/sqrt(3) vs
> > 1/sqrt(4)??
> >
> > Is the suggested approach here to re-write the encode/decode operations,
> > or is there any easier way?
> >
> > Thanks kindly -
> > John
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message