lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthick Sankarachary (JIRA)" <>
Subject [jira] Commented: (LUCENE-2425) An Anti-Merging Multi-Directory Indexing Framework
Date Tue, 04 May 2010 04:28:56 GMT


Karthick Sankarachary commented on LUCENE-2425:

Hi Michael,

To answer your first question, yes I do see some similarities between this issue and LUCENE-1879.
However, it appears that the latter serves only as a mirroring mechanism, whereas in this
feature mirroring is but one of its many applications (see LUCENE-2433). That said, the caching
split policy described in LUCENE-2433 does reuse the ParallelReader for reading the mirrors
(or splits) it maintains. The big differences that I see are as follows:

a) The split writer treats its (sub-)directories as black boxes, whereas the parallel writer
appears to regards them as white-boxes.
b) The parallel writer appears to require consumers to be aware of whether a sub-directory
is a master or slave. The split writer, on the other hand, insulates the consumer from the
implementation details of the mirroring mechanism, by providing them with a single, logical
view into the mirrored index. 
c) The parallel writer proposes to use a two-phase mechanism for ensuring consistency of add/delete
operations on the index. The mirroring split policy does not (yet) take care to ensure that
the changes operate as a "unit of work", i.e., in an all-or-nothing fashion. For example,
when you commit the split writer, it currently attempts to commit each of the writers for
its sub-directories, but without addressing the failure scenario. To me, that is an oversight
that can easily be remedied. 

The best way to understand the capabilities of the split policies outlined above is to take
a look at their test cases. At the risk of sounding cliche, the proof is in the pudding. 

To answer your second question, a split does not necessarily need to be physically under the
directory abstraction. For example, in the case of LUCENE-2431, LUCENE-2432, LUCENE-2433,
LUCENE-2434 and LUCENE-2435, the splits are either RAM-based directories or URI-based directories,
both of which reside outside of the "master" directory (to use the terminology of LUCENE-1879).

Note that I don't go out of my way to ensure the consistency of the "postings files (merge
choices, flush, deletions files, segments files, turning off the stores, etc.)" across the
splits in the mirrored split writer. Instead, I assume that as long as the mirrors are configured
and updated in the same way, then the doc store files in each mirror will eventually be consistent.


> An Anti-Merging Multi-Directory Indexing Framework
> --------------------------------------------------
>                 Key: LUCENE-2425
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Index
>    Affects Versions: 3.0.1
>            Reporter: Karthick Sankarachary
>         Attachments: LUCENE-2425.patch
> By design, a Lucene index tends to merge documents that span multiple segments into fewer
segments, in order to optimize its directory structure, which in turn leads to better search
performance. In particular, it relies on a merge policy to specify the set of merge operations
that should be performed when the index is optimized. 
> Often times, there's a need to do the exact opposite, which is to "split" the documents.
This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally,
user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents
based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an
indexing framework that builds on the Lucene index writer and reader, to address use cases
wherein documents need to diverge rather than converge.
> In brief, it associates zero or more sub-directories with the index's directory, which
serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by
a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory),
thus allowing it to modify its sub-directories as it sees fit. To make the index reader and
writer "observable", we extend Lucene's reader and writer with the goal of providing hooks
into every method that could potentially change the index. This allows for propagation of
such changes to the split policy, which essentially acts as a listener on the index.
> We refer to each sub-directory (or split) and the super-directory as a sub-index of the
containing index (a.k.a. the split index). Note that the sub-directory may not necessarily
be co-located with the super-directory. Furthermore, the split policy in turn relies on one
or more split rules to determine when to add or remove sub-directories. This allows for a
clear separation of the event that triggers a split from the management of those splits.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message