lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Commented: (LUCENE-1879) Parallel incremental indexing
Date Wed, 04 Nov 2009 22:13:32 GMT


Michael Busch commented on LUCENE-1879:

I realize the current implementation that's attached here is quite
complicated, because it works on top of Lucene's APIs.

However, I really like its flexibility. You can right now easily
rewrite certain parallel indexes without touching others. I use it in
quite different ways. E.g you can easily load one parallel index into a
RAMDirectory or SSD and leave the other ones on the conventional disk.

LUCENE-2025 only optimizes a certain use case of the parallel indexing,
where you want to (re)write a parallel index containing *only* posting
lists and this will especially improve scenarios like Yonik pointed
out a while ago on java-dev where you want to update only a few
documents, not e.g. a certain field for all documents.

In other use cases it is certainly desirable to have a parallel index
that contains a store. It really depends on what data you want to
update individually.

The version of parallel indexing that goes into Lucene's core I
envision quite differently from the current patch here. That's why I'd
like to refactor the IndexWriter (LUCENE-2026) into SegmentWriter and
let's call it IndexManager (the component that controls flushing,
merging, etc.). You can then have a ParallelSegmentWriter, which
partitions the data into parallel segments, and the IndexManager can
behave the same way as before.

You can keep thinking about the whole index as a collection of segments,
just now it will be a matrix of segments instead of a one-dimensional

E.g. the norms could in the future be a parallel segment with a single
column-stride field that you can update by writing a new generation of
the parallel segment.

Things like two-dimensional merge policies will nicely fit into this

Different SegmentWriter implementations will allow you to write single
segments in different ways, e.g. doc-at-a-time (the default one with
addDocument()) or term-at-a-time (like addIndexes*() works).

So I agree we can achieve updating posting lists the way you describe,
but it will be limited to posting lists then. If we allow (re)writing
*segments* in both dimensions I think we will create a more flexible
approach which is independent on what data structures we add to Lucene
- as long as they are not global to the index but per-segment as most
of Lucene's structures are today.

What do you think? Of course I don't want to over-complicate all this,
but if we can get LUCENE-2026 right, I think we can implement parallel
indexing in this segment-oriented way nicely.

> Parallel incremental indexing
> -----------------------------
>                 Key: LUCENE-1879
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>             Fix For: 3.1
>         Attachments: parallel_incremental_indexing.tar
> A new feature that allows building parallel indexes and keeping them in sync on a docID
level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> Discussion on java-dev:

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message