Return-Path: Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: (qmail 89047 invoked from network); 10 Nov 2010 15:49:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Nov 2010 15:49:33 -0000 Received: (qmail 94079 invoked by uid 500); 10 Nov 2010 15:50:04 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 93696 invoked by uid 500); 10 Nov 2010 15:50:03 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 93688 invoked by uid 99); 10 Nov 2010 15:50:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Nov 2010 15:50:02 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Nov 2010 15:49:56 +0000 Received: from marvin by rectangular.com with local (Exim 4.63) (envelope-from ) id 1PGCvW-0004cm-UM for lucy-dev@incubator.apache.org; Wed, 10 Nov 2010 07:49:34 -0800 Date: Wed, 10 Nov 2010 07:49:34 -0800 To: lucy-dev@incubator.apache.org Message-ID: <20101110154934.GA17757@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.13 (2006-08-11) From: Marvin Humphrey Subject: [lucy-dev] Index modernizer Greets, As the index format changes, we accumulate cruft in our codebase to support old indexes and old segments. At some point, we need to purge such cruft and abandon support for old indexes. But if you are a user, it's hard to know whether your index has old segments in it, and whether you can upgrade safely to a given version of the library. In theory, you can launch an Indexer, call Optimize(), and force it to rewrite your index as one large segment. But that hasn't always worked reliably, in either Lucene or KinoSearch, because modernizing is orthogonal to optimizing for search speed. Both libraries, at one time or another, have detected the case of an index with a single segment with no deletions, at which point they decide that the index is already optimized and bail out. I think a strategy dedicated specifically to modernization of an index is called for. For Lucy, it can be achieved with a application combining a BackgroundMerger and an IndexManager which implements a custom merge policy. Instead of rewriting to one large segment, this modernizer app should launch a BackgroundMerger once for each segment, rewriting them one at a time. Once all segments are brought up to date, the app exits. If possible, the modernizer should not rewrite segments that already use the most up-to-date format. This will be possible so long as the user has not subclassed Architecture to plug in custom index components. Under the default Architecture, the stack of writers is known and finite, and we can easily determine whether a given segment uses the most modern format for each component. If, on the other hand, a user has subclassed Architecture, we have to punt and rewrite all segments. Even that may not be sufficient, depending on whether custom components operate outside of the segment system -- but that's a far-off theoretical case, and I don't think adding an abstract Modernize() method to DataWriter which all components must implement is justified. I'm torn as to where to implement this functionality. Since it may be necessary to load custom classes, e.g. FieldType or Schema subclasses, that suggests a Cookbook/sample app which the user might modify. On the other hand, if we are going to require that users run this app in order to upgrade -- and we will, sooner or later -- maybe there ought to be a core class, Lucy::Index::Modernizer... Probably best to start with Cookbook/sample code which makes no public API promises, methinks... Marvin Humphrey