lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <>
Subject Re: Concurrent merge
Date Wed, 21 Feb 2007 22:46:14 GMT
I agree, that is why I suggested using the queue. You will add new  
segments to be merged to a queue. The merge thread pulls segments  
form the queue when it is free to do so. The addDocument() creates a  
new segment and adds it to the queue. If the queue is full, the add  
to the queue will block until the queue has room.

As for using multiple disks, this is a great idea. We use a balanced  
merge sort in our application, and splitting the merge across disks  
and using multiple threads, we were able to double the IO throughput.  
It probably doesn't make much difference for sites with advanced  
striped disk controllers/configurations, but for rather dumb/cheap  
installs it makes a huge difference, since a balanced merge involves  
massive amounts of sequential reads and writes.

On Feb 21, 2007, at 9:56 AM, Nadav Har'El wrote:

> On Tue, Feb 20, 2007, Ning Li wrote about "Concurrent merge":
>> I think it's possible for another version of IndexWriter to have
>> a concurrent merge thread so that disk segments could be merged
>> while documents are being added or deleted.
>> This would be beneficial not only because it will improve indexing
>> performance when there are enough system resources, but more
>> importantly, disk segment merges will no longer block document
>> additions or deletions.
> Hi Ning,
> Reading what you wrote here, I understand that the most important  
> benefit
> (as you see it) of this idea is to make the document addition time  
> more
> predictable: the current situation where addDocument() usually takes a
> second, but sometimes can take an hour, will be gone.
> I am wondering, however, whether this problem will actually be  
> solved by
> the single-background-merge-thread solution you propose. I'll  
> explain the
> possible problem I see below.
>> ...
>> Two limits on the total number of disk segments are used to detect
>> merge's lag and to slow down document additions/deletions: a soft
>> limit and a hard limit. When the number of disk segments reaches
>> the soft limit, a document addition/deletion will be slowed down
>> for time T. As the number of disk segments continues to grow, the
>> time for which an addition/deletion is slowed down will increase.
>> When the number of disk segments reaches the hard limit, document
>> additions/deletions will be blocked until the number falls under
>> the hard limit. The hard limit is almost never reached with proper
>> slow down.
>> ..
> How many merges can go on concurrently in your scheme? You  
> mentioned "a merge
> thread", which leads me to think that you mean that only one merge  
> can go on
> at once. You also said you prefer a single merge thread in the end  
> of your
> email.
> However, I think that allowing only one merge to proceed at once  
> can prevent
> you from achieving your intended improvment (of never having to  
> wait an hour
> for adding a single document). Consider the following scenario:
> Imagine that after having just added the milionth document, a huge  
> merge
> starts in the background, which is going to take an hour.
> At the same time, new documents continue to come in and be added to  
> the index.
> When you add more documents, you start getting small segments,  
> which need to
> be merged into larger segments, and so on. But if you don't allow  
> multiple
> merges to be active concurrently, the huge merge in progress will  
> cause these
> small merges to be postponed for later, and in just a few minutes  
> you'll
> accumulate many small segments, and very soon your "slowdown"  
> mechanism will
> start slowing down the additions, to the point that after a few  
> minutes
> additions will have to come to a grinding halt to avoid accumulating
> thousands of new small segments.
> This means that if you allow only one concurrent merge, your original
> intention - of allowing additions to continue in (almost) full steam
> while a huge merge is in progress - may not be realized.
> A counter-argument may be that if you slow down the additions  
> enough - say -
> wait 10 seconds between every addition - you'll not accumulate too  
> many new
> segments in this hour, and at the same time never wait more than 10  
> seconds
> for an addition, so everything is perfect. Well, this view is only  
> true if
> additions do come only once every 10 seconds. If a new file is  
> queued to be
> added every, say, once every 2 seconds, then your deliberate  
> slowdown will
> cause new additions to be significantly delayed - up to almost the  
> full hour!
> Moreover, the 10 seconds you spend sleeping (doing nothing) could  
> have been
> used for something more productive on a multiple-CPU machine - such as
> continuing to do additions and do more merging.
> One possibility of solving this problem is to allow multiple  
> background
> merge threads, not just one.
> Another possible solution I can think of is to use the background  
> thread for
> merging when it's free, but do merges in the foreground (blocking  
> the adds)
> when it is busy. This means slowing down the adds a bit, but not  
> with sleep()
> calls which doesn't utilize the available CPU, but with real work,  
> and adds
> will continue to proceed at the fastest possible speed (depending  
> on the
> machine). Presumably, the really huge merges (which I guess are the  
> ones
> that bother you) will happen in the background thread.
>> Other ideas are most welcome!
> An idea I once had, but I have no idea how much it can help, is this:
> It is conceivable (but I didn't actually test this) that a  
> significant (?)
> percentage of indexing time is IO time - disk write, read and seek.
> It might be possible, while we're doing a merge of a few segments  
> on one
> disk drive, to write new segments, and do new merges, on a second disk
> drive; This means that all the writing we do to the second disk  
> drive doesn't
> disturb the work that is being done on the first drive (less head  
> seeks,
> and more concurrent reading and writing on both disks).
> But as I said, I never tested how much this idea can improve  
> performance
> on systems with multiple separate disks and multiple CPUs.
>> size (buffered documents) is not too big. And multiple disk merge
>> threads require significant system resources to add benefit.
> See my comments above on why multiple concurrent merges might be  
> necessary,
> depending on what the "benefit" you were aiming at.
> Thanks,
> Nadav.
> -- 
> Nadav Har'El                        |      Wednesday, Feb 21 2007,  
> 3 Adar 5767
> IBM Haifa Research Lab               
> |-----------------------------------------
>                                     |It's fortunate I have bad luck  
> - without
>           |it I would have no luck at all!
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message