lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Li" <>
Subject Concurrent merge
Date Tue, 20 Feb 2007 19:22:53 GMT
I think it's possible for another version of IndexWriter to have
a concurrent merge thread so that disk segments could be merged
while documents are being added or deleted.

This would be beneficial not only because it will improve indexing
performance when there are enough system resources, but more
importantly, disk segment merges will no longer block document
additions or deletions.

I'd like to get feedback on this idea and after we agree on a best
design I can submit a full patch.

I have an initial implementation based on an earlier version of
Lucene (but with deletes via IndexWriter). The basic idea is to
separate a merge process into three steps:
  1 select disk segments to merge
  2 merge selected segments into one segment
  3 apply document deletions committed during the merge if any
    and replace selected segments with the result segment
The merge process is carried out in the merge thread. Steps 1 and
3 are executed in the critical section, but step 2, in which most
time is spent, is not.

There are three main challenges in enabling concurrent merge:
  1 a robust merge policy
  2 detect when merge lags document additions/deletions
  3 how to slow down document additions/deletions (and amortize
    the cost) when merge falls behind

Because new disk segments (flushed from ram) can continue to be
produced while a disk merge is going on, it is difficult to hold
the two invariants guaranteed by the current IndexWriter. Thus it
is important and challenging to detect when merge starts to lag
behind and to slow down document additions/deletions properly.

Several merge strategies are possible. In the initial implementation,
I adopted one similar to the merge policy in current IndexWriter.
Two limits on the total number of disk segments are used to detect
merge's lag and to slow down document additions/deletions: a soft
limit and a hard limit. When the number of disk segments reaches
the soft limit, a document addition/deletion will be slowed down
for time T. As the number of disk segments continues to grow, the
time for which an addition/deletion is slowed down will increase.
When the number of disk segments reaches the hard limit, document
additions/deletions will be blocked until the number falls under
the hard limit. The hard limit is almost never reached with proper
slow down.

Other ideas are most welcome!

I also experimented with a concurrent flush thread, which flushes
ram segments into a disk segment, and multiple disk merge threads.
The flush thread provides limited additional benefit when the ram
size (buffered documents) is not too big. And multiple disk merge
threads require significant system resources to add benefit.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message