lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Baby steps towards making Lucene's scoring more flexible...
Date Thu, 04 Mar 2010 17:23:38 GMT
On Tue, Mar 2, 2010 at 4:12 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote:
>> The problem is, these scoring models need the avg field length (in
>> tokens) across the entire index, to compute the norms.
>>
>> Ie, you can't do that on writing a single segment.
>
> I don't see why not.  We can just move everything you're doing on
> Searcher open to index time, and calculate the stats and norms
> before writing the segment out.
>
> At search time, the only segment with valid norms would be the last
> one, so we'd make sure the Searcher used those.

I see -- write norms for all segments (the full index) on each commit?
OK.

And in fact if we left it at searcher init time, you'd still
[technically] have to recompute the norms arrays across all segments
whenever one even tiny segment was added, since [technically] the
average has changed.  But I agree, once the index is large enough,
presumably the average won't change much, so...

Even in the NRT case we'd have to compute norms across the entire
index with only a small segment added.

> I think the fact that Lucy always writes one segment per indexing session --
> as opposed to Lucene's one segment per document -- makes a difference here.

Lucene isn't one segment per doc anymore -- it's one segment
per-when-RAM-buffer-filled-up.  Not sure it really makes a difference
though, since we [technically] need norms regen'd for the entire
index.

> Whether burning norms to disk at index time is the most efficient
> setup depends on the ratio of commits to searcher-opens.

Yes, and NRT opens.

> In a multi-node search cluster, pre-calculating norms at index-time
> wouldn't work well without additional communication between nodes to
> gather corpus-wide stats.  But I suspect the same trick that works
> for IDF in large corpuses would work for average field length: it
> will tend to be the stable over time, so you can update it
> infrequently.

Right I imagine we'd need to use this trick within a single index,
too.  Recomputing norms for entire index when only a small new segment
was added to the new NRT reader will probably be too costly.

Though one alternative (if you don't mind burning RAM) is to skip
casting to norms, ie store the actual field length, and do the
divide-by-avg during scoring (though that's a biggish hit to search
perf).

>> So I think it must be done during searcher init.
>>
>> The most we can do is store the aggregates (eg sum of all lengths in
>> this segment) in the SegmentInfo -- this saves one pass on searcher
>> init.
>
> Logically...
>
>   token_counts: {
>       segment: {
>           title: 4,
>           content: 154,
>       },
>       all: {
>           title: 98342,
>           content: 2854213
>       }
>   }
>
> (Would that suffice?  I don't recall the gory details of BM25.)

I think so, though why store all, per segment?  Reader can regen on
open?  (That above json comes from a single segment right?).

lnu.ltc would need sum(avg(tf)) as well.

> As documents get deleted, the stats will gradually drift out of
> sync, just like doc freq does.  However, that's mitigated if you
> recycle segments that exceed a threshold deletion percentage on a
> regular basis.

Right.

>> The norms array will be stored in this per-field sim instance.
>
> Interesting, but that wasn't where I was thinking of putting them.
> Similarity objects need to be sent over the network, don't they?  At
> least they do in KS.  So I think we need a local per-field
> PostingsReader object to hold such cached data.

OK maybe not stored on them, but, accessible to them.  Maybe cached in
the SegmentReader.

Though we need every norm(docID) lookup to be fast.  Maybe we ask the
per-field Similarity to give us a scorer, that holds the right byte[]?

>> > The insane loose typing of fields in Lucene is going to make it a
>> > little tricky to implement, though.  I think you just have to
>> > exclude fields assigned to specific similarity implementations from
>> > your merge-anything-to-the-lowest-common-denominator policy and
>> > throw exceptions when there are conflicts rather than attempt to
>> > resolve them.
>>
>> Our disposition on conflict (throw exception vs silently coerce)
>> should just match what we do today, which is to always silently
>> coerce.
>
> What do you do when you have to reconcile two posting codecs like this?
>
>  * doc id, freq, position, part-of-speech identifier
>  * doc id, boost
>
> Do you silently drop all information except doc id?

I don't know -- we haven't hit that yet ;)  The closest we have is
when <doc id> is merged with <doc id,freq,<position+>>, and in that
case we drop the freq,<position+>.

With flex this'll be up to the codec's merge methods.

>> > Similarity is where we decode norms right now.  In my opinion, it
>> > should be the Similarity object from which we specify per-field
>> > posting formats.
>>
>> I agree.
>
> Great, I'm glad we're on the same page about that.

Actually [sorry] I'm not longer so sure I agree!

In flex we have a separate Codec class that's responsible for
creating the necessary readers/writers.  It seems like Similarity is a
consumer of these stats, but need not know what format is used to
encode them on disk?

>> > Similarity implementation and posting format are so closely
>> > related that in my opinion, it makes sense to tie them.
>>
>> This confuses me -- what is stored in these stats (each field's
>> token length, each field's avg tf, whatever other a codec wants to
>> add over time...) should be decoupled from the low level format
>> used to store it?
>
> I don't know about that.  I don't think it's necessary to decouple
> them.  There might be some minor code duplication, but similarity
> implementations don't tend to be very large, so the DRY violation
> doesn't bother me.
>
> What's going to be a little tricky is that you can't have just one
> Similarity.makePostingDecoder() method.  Sometime's you'll want a
> match-only decoder.  Sometimes you'll want positions.  Sometimes
> you'll want part-of-speech id.  It's more of a interface/roles
> situation than a subclass situation.

match-only decoder is handled on flex now by asking for the DocsEnum
and then while iterating only using the .doc() (even if underlyingly
the codec spent effort decoding freq and maybe other things).

If you want positions you get a DocsAndPositionsEnum.

>> > If you're looking for small steps, my suggestion would be to
>> > focus on per-field Similarity support.
>
>> Well that alone isn't sufficient -- the index needs to
>> record/provide the raw stats, and doc boosting (norms array) needs
>> to be done using these stats.
>
> Not sufficient, but it's probably a prerequisite.  Since it's a
> common feature request anyway, I think it's a great place to start:
>
>    http://lucene.markmail.org/message/ln2xkesici6aksbi
>    http://lucene.markmail.org/thread/46vxibpubogtcy3g
>    http://lucene.markmail.org/message/56bk6wrbwallyjvr
>    https://issues.apache.org/jira/browse/LUCENE-2236

Agreed, it's definitely a prereq.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message