lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Baby steps towards making Lucene's scoring more flexible...
Date Mon, 15 Mar 2010 05:03:19 GMT
On Sat, Mar 13, 2010 at 06:41:26AM -0500, Michael McCandless wrote:

> I still don't think similarity should have any bearing during indexing.

Similarity has always, from day one, affected the contents of the index.  This
idea that it should be totally divorced from indexing is, in fact, a very
significant change that you are proposing for Lucene, and it will require
non-trivial changes to the file format. 

For starters, you're going to at least double the footprint of the norms.  For
fields with more than 127 tokens or 127 unique terms, the increase will be
greater... and if the user sets doc-boost and field-boost in a pattern that
defies RLE compression, the footprint will be greater still.

I happen to think that limited search-time settability of Similarity offers a
nice feature -- the ability to futz with different weighting models and length
normalization settings without reindexing -- and that it's worth exploring in
pursuit of this feature.

But by opting to forego the lossy compression now performed by encodeNorm() at
index-time and store precursor statistics instead, we are going to take a hit
on index size even with lossless compression.

Furthermore, delaying Similarity choice means that it becomes the user's
responsibility to ensure that index-time Codec choice is compatible with
search-time Similarity choice.  In contrast, setting Similarity at index-time
means that the core gets to pick the Codec and can ensure that all the
necessary data gets encoded, sparing the user from having to understand the
gory details of posting formats.

In summary, I think search-time setting of Similarity is a nice feature but a
poor requirement.  I'm not persuaded that this proposal to banish Similarity
from index-time is wise.

> But I don't like baking in search concepts at index time...

Then you ought to use a traditional RDBMS rather than an indexing engine, and
make sure you don't put indexes on any of the fields in your tables.  :)  

Or maybe an RDBMS has too many search concepts baked in, and a flat file would
be best.  :)

Seriously... optimizing on-disk data structures to accommodate anticipated
search query patterns and maximize speed and relevance... that's what
indexing's all about, ain't it?

And what class other than Similarity knows enough about the scoring algorithm
to perform these data reduction tasks?  If it's not goint to be Similarity
itself, it has to be something that know absolutely everything about the
Similarity implementation's scoring model.

> > Right.  However, now that I've thought about it, if a user indicates that a
> > field is "match-only" by supplying a MatchSimilarity, we know that we can
> > omit boost bytes.
> >
> > So we can re-conceive "MatchSimilarity" as being analogous to omitNorms.
> > Huzzah!
> >
> > One down, one to go.  :)
> Hmm except shouldn't you allow omitting boost bytes but keeping term
> freqs?  Ie all docs are roughly the same length (say, a title field)
> and I never boost them?  How will you allow this?

I think that you've described an uncommon use case, and it's tempting to just
wave it off with the easy answer: you spec a Sim that writes such a format.

But here's where maybe Lucy can steal from the Lucene flex branch.  We can
give Similarity a makePostingCodec() factory method.  Then, we can publish
common PostingCodecs as public classes, allowing us to support different
formats with minimal effort.

  class MySim extends Similarity {
    public PostingCodec makePostingCodec() {
      StandardPostingCodec codec = new StandardPostingCodec();
      return (PostingCodec)codec;

(FWIW, you could theoretically do something similar with Lucene: supply one
Sim at index time, but write precursors instead of boost bytes and allow a
different Sim to be used at search-time.)

This setup follows the easy-things-easy-hard-things-possible model, because
the user doesn't have to know posting formats intimately to start optimizing
away needless data, but experts like Earwin get the direct access they
seemingly can't live without.

> I agree it's not great to have to speak/think in low level indexing
> attr concepts... because it forces user to translate to what that
> means at search time.  But I still don't see a great alterntative.  I
> don't like pushing the Sim choice all the way back into indexing.

You make it sound like that's the way things have been done since forever is
some radical experiment. :P

Similarity choice IS made at index time.  Nobody's pushing it back -- you're
proposing that it be pushed forward.

> > Under Lucy, you can't switch to a different weighting model at search time
> > because the boost bytes are baked into the index.  But you can still do
> > doc-id-only posting iteration against any posting format since doc-id-only is
> > the minimum requirement for a posting list.
> >
> > So your question is predicated on the assumption that you need a
> > doc-id-only Similarity to do doc-id-only postings iteration, but that's not
> > true -- you need a doc-id-only PostingDecoder, which may be spawned by any
> > Similarity.
> >
> > Does that make sense?
> It sounds like... if the user had used AllBellsAndWhistlesScoringSim
> while indexing, they will still be able to use MatchOnlySim while
> searching because under-the-hood MatchOnlySim knows how to pull a
> docID only postings iterator from that field.

You seem to be fixated on the notion of swapping in a MatchOnlySim object at
search time.  You can't do that in KS/Lucy, because you can't modify a Schema
at search-time, and the per-field Similarity assignments are part of the
Schema.  But *it doesn't matter* because you don't need a MatchOnlySim to
do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can
spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can.

Marvin Humphrey

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message