lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nadav Har'El" <>
Subject Re: Concurrent merge
Date Wed, 21 Feb 2007 15:56:48 GMT
On Tue, Feb 20, 2007, Ning Li wrote about "Concurrent merge":
> I think it's possible for another version of IndexWriter to have
> a concurrent merge thread so that disk segments could be merged
> while documents are being added or deleted.
> This would be beneficial not only because it will improve indexing
> performance when there are enough system resources, but more
> importantly, disk segment merges will no longer block document
> additions or deletions.

Hi Ning,

Reading what you wrote here, I understand that the most important benefit
(as you see it) of this idea is to make the document addition time more
predictable: the current situation where addDocument() usually takes a
second, but sometimes can take an hour, will be gone.

I am wondering, however, whether this problem will actually be solved by
the single-background-merge-thread solution you propose. I'll explain the
possible problem I see below.

> Two limits on the total number of disk segments are used to detect
> merge's lag and to slow down document additions/deletions: a soft
> limit and a hard limit. When the number of disk segments reaches
> the soft limit, a document addition/deletion will be slowed down
> for time T. As the number of disk segments continues to grow, the
> time for which an addition/deletion is slowed down will increase.
> When the number of disk segments reaches the hard limit, document
> additions/deletions will be blocked until the number falls under
> the hard limit. The hard limit is almost never reached with proper
> slow down.

How many merges can go on concurrently in your scheme? You mentioned "a merge
thread", which leads me to think that you mean that only one merge can go on
at once. You also said you prefer a single merge thread in the end of your

However, I think that allowing only one merge to proceed at once can prevent
you from achieving your intended improvment (of never having to wait an hour
for adding a single document). Consider the following scenario:

Imagine that after having just added the milionth document, a huge merge
starts in the background, which is going to take an hour.
At the same time, new documents continue to come in and be added to the index.
When you add more documents, you start getting small segments, which need to
be merged into larger segments, and so on. But if you don't allow multiple
merges to be active concurrently, the huge merge in progress will cause these
small merges to be postponed for later, and in just a few minutes you'll
accumulate many small segments, and very soon your "slowdown" mechanism will
start slowing down the additions, to the point that after a few minutes
additions will have to come to a grinding halt to avoid accumulating
thousands of new small segments.

This means that if you allow only one concurrent merge, your original
intention - of allowing additions to continue in (almost) full steam
while a huge merge is in progress - may not be realized.

A counter-argument may be that if you slow down the additions enough - say -
wait 10 seconds between every addition - you'll not accumulate too many new
segments in this hour, and at the same time never wait more than 10 seconds
for an addition, so everything is perfect. Well, this view is only true if
additions do come only once every 10 seconds. If a new file is queued to be
added every, say, once every 2 seconds, then your deliberate slowdown will
cause new additions to be significantly delayed - up to almost the full hour!
Moreover, the 10 seconds you spend sleeping (doing nothing) could have been
used for something more productive on a multiple-CPU machine - such as
continuing to do additions and do more merging.

One possibility of solving this problem is to allow multiple background
merge threads, not just one.

Another possible solution I can think of is to use the background thread for
merging when it's free, but do merges in the foreground (blocking the adds)
when it is busy. This means slowing down the adds a bit, but not with sleep()
calls which doesn't utilize the available CPU, but with real work, and adds
will continue to proceed at the fastest possible speed (depending on the
machine). Presumably, the really huge merges (which I guess are the ones
that bother you) will happen in the background thread.

> Other ideas are most welcome!

An idea I once had, but I have no idea how much it can help, is this:
It is conceivable (but I didn't actually test this) that a significant (?)
percentage of indexing time is IO time - disk write, read and seek.
It might be possible, while we're doing a merge of a few segments on one
disk drive, to write new segments, and do new merges, on a second disk
drive; This means that all the writing we do to the second disk drive doesn't
disturb the work that is being done on the first drive (less head seeks,
and more concurrent reading and writing on both disks).

But as I said, I never tested how much this idea can improve performance
on systems with multiple separate disks and multiple CPUs.

> size (buffered documents) is not too big. And multiple disk merge
> threads require significant system resources to add benefit.

See my comments above on why multiple concurrent merges might be necessary,
depending on what the "benefit" you were aiming at.


Nadav Har'El                        |      Wednesday, Feb 21 2007, 3 Adar 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |It's fortunate I have bad luck - without           |it I would have no luck at all!

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message