lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Smith <nick.sm...@techop.ch>
Subject Re: Issue with Similarity and negative numbers
Date Thu, 11 Sep 2003 09:08:18 GMT

>Date: Wed, 10 Sep 2003 14:11:30 -0700
>From: Doug Cutting <cutting@lucene.com>
>Subject: Re: Issue with Similarity and negative numbers
>
>-1
>
>This would be an incompatible change that could break lots of folks. 
>Also, the range of values that you represent in your one-byte float 
>format is less useful to most Lucene applications.  Negative values are 
>rarely used, and normalizing values to be between 0 and 1 is not always 
>easy.

I had taken care to make sure that the change was *compatible*.  :-(

What about the change would break lots of folks?  My rational was that
if the mapping for positive bytes to postive floats and visa-versa was
unchanged the only way to store negative bytes in the index would be
to use a negative float as a field or document boost.

>Can you please describe more about what you're trying to achieve?  There 
>are lots of other ways of efficiently implementing date-sorted search 
>results.  For example, you can add the documents to the index in 
>chronological order, then use a HitFilter which collects the documents 
>with the highest document id.  That is very efficient and requires no 
>changes to Lucene.

I have a highly dynamic index of news headlines where the incoming
headlines are often not in cronological order.  To make things worse
changes must to be made to headlines post-indexing without affecting
their chronological order.

I overide the default Similarity instance to disable field
normalization and set the date-sorting 'hint' using
Document.setBoost(float)

Also using the score I can implementing a forward / back paging as
the score is persistent and the document ids are not. I do this
my using a org.apache.lucene.search.Filter and accessing the
scores through IndexReader.norms(String field) and only setting
the BitSet when score is in required range.

A previous solution used the HitFilter and document id solution
that you suggested. Alas it did not work 100% correctly.

Is there a FAQ entry about common date-sorting methods?

>Cheers,
>
>Doug
>

Many Thanks for a greate product!

Nick

>Nick Smith wrote:
>> Hi Luceners!
>> 
>> I am misusing the document score for date sorting (I display news
>> headlines in a chronological list).
>> 
>> As the document score is ultimately encoded as a byte the maximum
>> possible number of values is 256 minus the special value of 0
>> (document not found).
>> 
>> In the current implementation; all negative float values get
>> rounded up to zero by Similarity.floatToByte() and the method
>> Similarity.byteToFloat() returns only values in the range of
>> 1 to 127 values that are greater than the decode for the
>> next lower byte value.
>> 
>> i.e. 
>> Similarity.byteToFloat(byteVal+1) > Similarity.byteToFloat(byteVal)
>> 
>> For my application having 255 possible scores from searches was better
>> than 127 so....
>> 
>> I have patched the Similarity class to encode negative floats into
>> the negative byte values and to decode the negative byte values back
>> into negative floats.
>> 
>> The encoding of the positive values are unchanged by this patch.
>> 
>> Could this version please be checked into CVS by someone with commit
>> rights?  Or is there are a more formal procedure to submitting patches,
>> say via the Bugzilla?
>> 
>> Many Thanks,
>> 
>> Nick Smith
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


Mime
View raw message