incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Kurz" <>
Subject Re: Posting codecs
Date Mon, 29 Sep 2008 19:09:16 GMT
On Sat, Sep 27, 2008 at 3:14 PM, Marvin Humphrey <> wrote:
>> The downside is that each Scorer remains tied to particular index
>> format.  Long-term I still think this is disastrous, but in the short
>> term it's not that bad.
> Can you please elaborate on what you see as the downsides?

I could be wrong about this, but I'll start here, as this question
relates closely to my motivations.

In the time I was trying to make use of KinoSearch, I found it very
difficult to experiment with new scoring systems and new index
formats.  Each time I wanted to try something new (positional scoring
instead of TF/IDF, or reading posting lists from SQLite) it felt like
I had reinvent the whole wheel.  Despite the layers of abstraction,
there are still a lot of cross-dependencies.

Currently, each posting class is tied to the internal binary format of
the index in use.   And the low level scorers (like PhraseScorer)
presume a binary layout of the Posting.   Creating a new index format
involves either involves creating a whole bunch of classes, or
understanding the interactions of the existing classes well enough to
maintain full compatibility.  Despite considerable time spent, I still
don't feel like I understand these interactions.

Worse, my own uses of KinoSearch are likely to include custom scorers
interacting with custom indexes.  It's highly unlikely that anyone
else is going to have exactly the same needs.  But it seems reasonably
likely that others would be interested in just the scoring approach,
or just the index format.  I think development might go much faster if
these two could be decoupled.

The goal would be to make it possible to write a new index format as a
single class and to have the existing scorers just keep working.
Conversely, I want to have my (theoretical) custom scorers keep
working even though the underlying index format changes.  I want it to
be possible to use others' components piece by piece without having to
replace the whole Scorer/Posting/InStream/etc complex, and to make it
possible for others to use and test my components without having to
use my whole system.

I don't think this is possible with the current approach, and I fear
this will hinder future development and developers.  This could be
just evidence of my own limitations, though.  Perhaps I just need
better examples of how to accomplish these things with the current
system.  Thus my suggestions for adding parallel support for P4Delta
compression, reading Lucene indexes directly, and non-TF/IDF
positional scoring.  I'm hoping that either you'll show the way to do
this effectively, or realize the need for architectural changes to
allow this.

Nathan Kurz

View raw message