lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Multi-node stats within individual nodes (was "Baby steps...")
Date Tue, 09 Mar 2010 07:28:39 GMT
On Mon, Mar 08, 2010 at 02:23:47PM -0500, Michael McCandless wrote:
> For a large index the stats will be stable after re-indexing only a
> few more docs.

Well, not if there's been huge churn on other nodes in the interim.

> No... the stat is avg tf within the doc.  

Don't you need the *total* field length -- not just the average tf -- for the
docXfield in question to perform length normalization?  

Or is average term frequency within the docXfield a BM25-specific precursor
that you are using as an example stat?

> So if I index this doc:
> 
>   a a a a b b b c c d
> 
> The avg(tf) = average(4 3 2 1) = 2.5.
> 
> So we'd store 2.5 for that docXfield in a fixed-width dense postings
> list (like column stride fields -- every doc has a value).

Like column-stride fields, but also analogous to the current "norms" -- only
with 4x the space requirements.  That is, unless you compress that float down
to a byte, as is currently done with the norm (3 bit mantissa, 5 bit
exponent).

The generation of a "norm" byte involves some pretty intense lossy
data-reduction.  If you're going to store the pre-data-reduction raw
materials, you're going to incur a space penalty unless you can eke out
similar savings somewhere.

The coarse quantization is justified because we only care about big
differences at search-time.  If two documents are judged as reasonably close
to each other in relevance, the order in which they rank isn't important.
It's only when docs are judged as far apart in relevance that their relative
rank order matters.

I don't know that compressing the raw materials is going to work as well as
compressing the final product.  Early quantization errors get compounded when
used in later calculations.

BTW, I think we should refer to these bytes as "boost bytes" rather than
"norms".  Their purpose is not simply to convey length normalization; they
also include document boost and field boost.  And the length normalization
multiplier is a kind of boost... so "boost byte" has everything covered, and
avoids the overloading of the term "norm".

Marvin Humphrey



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message