lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject [lucy-dev] Index modernizer
Date Wed, 10 Nov 2010 15:49:34 GMT

As the index format changes, we accumulate cruft in our codebase to support
old indexes and old segments.  At some point, we need to purge such cruft and
abandon support for old indexes.  But if you are a user, it's hard to know
whether your index has old segments in it, and whether you can upgrade safely
to a given version of the library.

In theory, you can launch an Indexer, call Optimize(), and force it to rewrite
your index as one large segment.  But that hasn't always worked reliably, in
either Lucene or KinoSearch, because modernizing is orthogonal to optimizing
for search speed.  Both libraries, at one time or another, have detected the
case of an index with a single segment with no deletions, at which point they
decide that the index is already optimized and bail out.

I think a strategy dedicated specifically to modernization of an index is
called for.  For Lucy, it can be achieved with a application combining a
BackgroundMerger and an IndexManager which implements a custom merge policy.
Instead of rewriting to one large segment, this modernizer app should launch a
BackgroundMerger once for each segment, rewriting them one at a time.  Once
all segments are brought up to date, the app exits.

If possible, the modernizer should not rewrite segments that already use the
most up-to-date format.  This will be possible so long as the user has not
subclassed Architecture to plug in custom index components.  Under the default
Architecture, the stack of writers is known and finite, and we can easily
determine whether a given segment uses the most modern format for each

If, on the other hand, a user has subclassed Architecture, we have to punt and
rewrite all segments.  Even that may not be sufficient, depending on whether
custom components operate outside of the segment system -- but that's a
far-off theoretical case, and I don't think adding an abstract Modernize()
method to DataWriter which all components must implement is justified.

I'm torn as to where to implement this functionality.  Since it may be
necessary to load custom classes, e.g. FieldType or Schema subclasses, that
suggests a Cookbook/sample app which the user might modify.  On the other
hand, if we are going to require that users run this app in order to upgrade
-- and we will, sooner or later -- maybe there ought to be a core class,
Lucy::Index::Modernizer...  Probably best to start with Cookbook/sample code
which makes no public API promises, methinks...

Marvin Humphrey

View raw message