lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Baby steps towards making Lucene's scoring more flexible...
Date Tue, 02 Mar 2010 21:12:40 GMT
On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote:
> The problem is, these scoring models need the avg field length (in
> tokens) across the entire index, to compute the norms.
>
> Ie, you can't do that on writing a single segment.

I don't see why not.  We can just move everything you're doing on Searcher
open to index time, and calculate the stats and norms before writing the
segment out.

At search time, the only segment with valid norms would be the last one, so
we'd make sure the Searcher used those.

I think the fact that Lucy always writes one segment per indexing session --
as opposed to Lucene's one segment per document -- makes a difference here.

Whether burning norms to disk at index time is the most efficient setup
depends on the ratio of commits to searcher-opens.

In a multi-node search cluster, pre-calculating norms at index-time wouldn't
work well without additional communication between nodes to gather corpus-wide
stats.  But I suspect the same trick that works for IDF in large corpuses
would work for average field length: it will tend to be the stable over time,
so you can update it infrequently.

> So I think it must be done during searcher init.
> 
> The most we can do is store the aggregates (eg sum of all lengths in
> this segment) in the SegmentInfo -- this saves one pass on searcher
> init.

Logically...

   token_counts: {
       segment: {
           title: 4,
           content: 154,
       },
       all: {
           title: 98342,
           content: 2854213
       }
   }

(Would that suffice?  I don't recall the gory details of BM25.)

As documents get deleted, the stats will gradually drift out of sync, just
like doc freq does.  However, that's mitigated if you recycle segments that
exceed a threshold deletion percentage on a regular basis.

> The norms array will be stored in this per-field sim instance.

Interesting, but that wasn't where I was thinking of putting them.  Similarity
objects need to be sent over the network, don't they?  At least they do in KS.
So I think we need a local per-field PostingsReader object to hold such cached
data.

> > The insane loose typing of fields in Lucene is going to make it a
> > little tricky to implement, though.  I think you just have to
> > exclude fields assigned to specific similarity implementations from
> > your merge-anything-to-the-lowest-common-denominator policy and
> > throw exceptions when there are conflicts rather than attempt to
> > resolve them.
> 
> Our disposition on conflict (throw exception vs silently coerce)
> should just match what we do today, which is to always silently
> coerce.

What do you do when you have to reconcile two posting codecs like this?

  * doc id, freq, position, part-of-speech identifier
  * doc id, boost

Do you silently drop all information except doc id?

> > Similarity is where we decode norms right now.  In my opinion, it
> > should be the Similarity object from which we specify per-field
> > posting formats.
> 
> I agree.

Great, I'm glad we're on the same page about that.

> > Similarity implementation and posting format are so closely related
> > that in my opinion, it makes sense to tie them.
> 
> This confuses me -- what is stored in these stats (each field's token
> length, each field's avg tf, whatever other a codec wants to add over
> time...) should be decoupled from the low level format used to store
> it?

I don't know about that.  I don't think it's necessary to decouple them.
There might be some minor code duplication, but similarity implementations
don't tend to be very large, so the DRY violation doesn't bother me.

What's going to be a little tricky is that you can't have just one
Similarity.makePostingDecoder() method.  Sometime's you'll want a match-only
decoder.  Sometimes you'll want positions.  Sometimes you'll want
part-of-speech id.  It's more of a interface/roles situation than a subclass
situation.

> > If you're looking for small steps, my suggestion would be to focus
> > on per-field Similarity support.
> 
> Well that alone isn't sufficient -- the index needs to record/provide
> the raw stats, and doc boosting (norms array) needs to be done using
> these stats.

Not sufficient, but it's probably a prerequisite.  Since it's a common feature
request anyway, I think it's a great place to start:

    http://lucene.markmail.org/message/ln2xkesici6aksbi
    http://lucene.markmail.org/thread/46vxibpubogtcy3g
    http://lucene.markmail.org/message/56bk6wrbwallyjvr
    https://issues.apache.org/jira/browse/LUCENE-2236

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message