Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <20100228183809.GA8267@rectangular.com>
References: <9ac0c6aa1002260950q3ebb0638y917d7203435de1ec@mail.gmail.com>
	 <20100228183809.GA8267@rectangular.com>
Date: Tue, 2 Mar 2010 05:55:44 -0500
Message-ID: <9ac0c6aa1003020255h5eb6be80o2dc53dfba0105c05@mail.gmail.com>
Subject: Re: Baby steps towards making Lucene's scoring more flexible...
From: Michael McCandless <lucene@mikemccandless.com>
To: java-dev@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1

On Sun, Feb 28, 2010 at 1:38 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Fri, Feb 26, 2010 at 12:50:44PM -0500, Michael McCandless wrote:
>
>>   * Store additional per-doc stats in the index, eg in a custom
>>     posting list,
>
> Inline, as in a payload?  Of course that can work, but if the data
> is common over multiple postings, you pay in space to gain locality.
> KinoSearch does this right now -- inlining the norm byte within each
> posting -- but I think it was a dubious tradeoff.

No, not inlined -- stored once, seperately, and iterated directly from
there.  I think the stats will be used, once, at searcher init time, and
turned into in-RAM arrays (like norms).

> But why do we have to make a choice one way or the other?  So long
> as the decoder has the option of reading from arbitrary streams and
> data structures, we can leave it up to the posting format spec
> whether to inline or not.

Well for starters I think we should hardwire the stats (computing only
field length & avg term freq).  Later we can make what's stored, and,
how it's stored pluggable.  Baby steps...

>>     including length in tokens of the field, avg tf, and boost
>>     (boost can be efficiently stored so only if it differs from
>>     default is it stored).
>
> Efficient compression always depends on identifying patterns within
> data.  Classical examples:
>
>  * Audio signals delta encode well because they are continuously
>    variable and adjacent samples are usually close together in
>    magnitude.
>  * Line art has a great affinity for run-length-encoding because
>    many contiguous pixels have exactly the same color value.
>
> Since boost is optional and a lot of people don't use it, it's true
> that if we break it out away from length normalization it can be
> stored efficiently using something like RLE.  However, boost isn't
> the only thing that may be worth compressing: single-token and
> match-only fields have constant length-in-tokens.

Good point.

For starters I think we'll use some simple fixed encoding.  Users can
turn on/off the stats if they don't want to pay the disk space.

> Surely the posting format specification must be allowed to exploit
> any compression technique, at own discretion and with insight into
> data patterns unavailable to us from the top level.

Eventually, yes (when we make this pluggable).

>>     Do not compute nor store norms in the index.
>
> Heh.  Well, with Lucy, naturally we'd want to do the opposite of
> that: write everything at index time and then mmap at search time.
> Even in Lucene, it seems odd to want to calculate all of those on
> the fly each time you open an index.  It seems to me that this is a
> specialized need of BM25.

The problem is, these scoring models need the avg field length (in
tokens) across the entire index, to compute the norms.

Ie, you can't do that on writing a single segment.

So I think it must be done during searcher init.

The most we can do is store the aggregates (eg sum of all lengths in
this segment) in the SegmentInfo -- this saves one pass on searcher
init.

> So how do you support both?  Easy: allow the postings reader an
> initial startup phase where it can cache arbitrary data, typically
> data which is shared across multiple postings. For instance:
>
>  * Either slurp (Lucene) or mmap (Lucy) a pregenerated norms file.
>  * Accumulate/derive data, e.g true normalization data for BM25.
>  * Do nothing, because the field is match-only.
>
> You'd want to do this inside some sort of per-field class, which
> would then spawn individual enumerators.  The shared reader holds
> common data; the enumerators reference the common data, but maintain
> state while iterating over postings.
>
>  SegmentReader
>   PostingListReader
>     map of field-PostingsReader pairs <-- per-field shared data, persistent
>
> Requests for various enumerators (doc id only, doc id + positions,
> etc) would be dispatched through to the PostingsReader for the
> specified field.

This is roughly how flex does it today for postings (though you
gain access to the postings reader via fields & terms enums).

But stats are different?  They are like column stride fields -- ie a
posting list that is usually dense (every doc is included), and is
iterated at the field level (not field + term).

Or, hmmm... maybe term will be the name of the statistic, and maybe we
use Attribute to keep it "generic".  Maybe....

>> Merging would just concatenate these values (removing deleted
>> docs).
>
> The ability to concatenate depends on having either no externally
> referenced data, or having all externally referenced data be
> immutable.
>
> Theoretically, concatenation works fine for delta-encoded position
> data because positions start over at 0 with each document.  They're
> wholly self-contained.
>
> As a counter example, consider that we can concatenate serialized
> documents so long as the associations between field names and
> numbers stay constant, but we must decode and re-encode when those
> associations change.
>
> The merger cannot know in advance whether concatenation is
> appropriate because the codec is opaque to it.  I think the default
> should be to merge fully expanded objects, with the option to
> concatenate as an optimization left up to the codec.  Otherwise, we
> end up restricting the kind of external references that can be made
> by a posting.

Sorry -- concatenate was overloaded term -- I didn't mean we would
literally physically concatenate bytes (thus requiring a posting
format that allows for that).  I meant, logically, this is what
merging does.  Ie we preserve all the per-doc stats for all
non-deleted docs when we merge.

For actual impl, I agree default is to fully decode & re-encode. This
is how flex handles merging today (but codec can override, at each
level of fields/terms/postings, if it wants).

Lucene's default postings format could nicely support raw byte
concatenation, since the docIDs are delta coded.  You'd just have to
rewrite the one posting at each boundary.  But I don't think we've
explored that one yet... could be a good merging speedup though, but
only for the no (or maybe, not many?) deletes case.

>>   * Change IR so on open it generates norms dynamically, ie by
>>     walking the stats, computing avgs (eg avg field length in
>>     tokens), and computing the final per-field boost, casting to a
>>     1-byte quantized float.
>
> Right, I think this data should be cached by a per-field parent
> reader object during an initialization phase.

Yes.

>>     We may want to store aggregates in eg SegmentInfo to save the
>>     extra pass on IR open...
>
> Hmm, that sounds familiar...
>
>   https://svn.apache.org/repos/asf/lucene/lucy/trunk/core/Lucy/Index/Segment.bp
>
>   Each Segment object keeps track of information about an index
>   segment: its fields, document count, and so on.  The Segment
>   object itself writes one file, <code>segmeta.json</code>; besides
>   storing info needed by Segment itself, the "segmeta" file serves
>   as a central repository for metadata generated by other index
>   components -- relieving them of the burden of storing metadata
>   themselves.

Right, this is Lucy's correlate for Lucene's SegmentInfos/segments_N
file.

> As far as aggregates go, I think you want to be careful to avoid
> storing any kind of data that scales with segment size within a
> SegmentInfo.

Right.  By aggregates I mean eg the total sum of all field lengths (a
single long) for the segment.

>>   * Change Similarity, to allow field-specific Similarity (I think
>>     we have issue open for this already).  I think, also,
>>     lengthNorm (which is no longer invoked during indexing) would
>>     no longer be used.
>
> Well, as you might suspect, I consider that one a gimme.  KinoSearch
> supports per-field Similarity now.

Great!  This is pretty much required, since each field now has
substantial init cost & state associated with it.  The norms array
will be stored in this per-field sim instance.

> The insane loose typing of fields in Lucene is going to make it a
> little tricky to implement, though.  I think you just have to
> exclude fields assigned to specific similarity implementations from
> your merge-anything-to-the-lowest-common-denominator policy and
> throw exceptions when there are conflicts rather than attempt to
> resolve them.

Our disposition on conflict (throw exception vs silently coerce)
should just match what we do today, which is to always silently
coerce.

I think I'd coerce the way norms do -- "once enabled always
enabled".

Throwing exception is bad because we can't detect that until it's too
late, at which point no merges of different segments can run.

If you mark the fields as 'no stats', index docs, then later mark the
field as 'stats', then on merge the stats will be set to default
values.

>> I think we'd make the class that computes norms from these per-doc
>> stats on IR open pluggable.
>
> Similarity is where we decode norms right now.  In my opinion, it
> should be the Similarity object from which we specify per-field
> posting formats.

I agree.

> See my reply to Robert in the BM25 thread:
>
>    http://markmail.org/message/77rmrfmpatxd3p2e
>
>    That way, custom scoring implementations can guarantee that they
>    always have the posting information they need available to make
>    their similarity judgments. Similarity also becomes a more
>    generalized notion, with the TF/IDF-specific functionality moving
>    into a subclass.
>
> Similarity implementation and posting format are so closely related
> that in my opinion, it makes sense to tie them.

This confuses me -- what is stored in these stats (each field's token
length, each field's avg tf, whatever other a codec wants to add over
time...) should be decoupled from the low level format used to store
it?

A BM25 similarity impl plugged into Lucene shouldn't know or care how
the codec had written the stats to disk?  It only cares that it can
ask for the full enum (field token length for all docs).

>> And, someday we could make what stats are gathered/stored during
>> indexing pluggable but for starters I think we should simply
>> support the field length in tokens and avg tf per field.
>
> I would argue against making this your top priority, because I think
> adding half-baked features that require index-format changes is bad
> policy.

I'm also against doing this ("making what stats are stored, and how,
fully pluggable") for the dirt path.  I plan to start with "only" the
2 stats (per doc field token length and per doc avg tf).

I'm going to pick an index format that encodes these stats.  It won't
be pluggable on the first go.  Start simple and then iterate...

> If you're looking for small steps, my suggestion would be to focus
> on per-field Similarity support.

Well that alone isn't sufficient -- the index needs to record/provide
the raw stats, and doc boosting (norms array) needs to be done using
these stats.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org