lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Multi-node stats within individual nodes (was "Baby steps...")
Date Tue, 09 Mar 2010 19:11:07 GMT
On Tue, Mar 09, 2010 at 01:04:19PM -0500, Michael McCandless wrote:
> BM25 needs the field length in tokens.  lnu.ltc needs avg(tf).  These
> 2 stats seem to the "common" ones (according to Robert).  So I want to
> start with them.

OK, interesting.

> > I don't know that compressing the raw materials is going to work as well as
> > compressing the final product.  Early quantization errors get compounded when
> > used in later calculations.
> 
> I would not compress for starters...

How about lossless compression, then?  Do you need random access into this
specialized posting list?  For the use cases you've described so far I don't
think so, since you're just iterating it top to bottom on segment open.

You could store the total length of the field in tokens and the number of
unique terms as integers, compressing with vbyte, PFOR or whatever... then
divide at search time to get average term frequency.  That way, you also avoid
committing to a float encoding, which I don't think Lucene has standardized
yet.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message