lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 34930] - IndexWriter.maybeMergeSegments() takes lots of CPU resources
Date Mon, 16 May 2005 22:53:12 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=34930>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=34930





------- Additional Comments From psmith@apache.org  2005-05-17 00:53 -------
>> Your benchmark might run faster if you set maxBufferedDocs smaller.  Also, it
>> doesn't look like you're including the cost of closing the IndexWriter in your
>> benchmark statistics.  You should, as, with such a large buffer, you've delayed
>> much of the work to that point.
>> 

Yes, by not factoring in the optmize()/close() call into the rate calculation,
there is still 'work to be done' at the end, but that would only be the tail end
of the remaining docs stored in memory, right?  When indexing millions of
records, this is probably not going to be a large percentage of the overall
time, as it would only, at most, the last maxBufferedDocs to be tidied up.  Or
have I confused myself?  I'm still quite new to Lucene and it's inner workings.  

>> The bottleneck you're hitting is that maybeMergeDocs sums the size of the
>> buffered indexes each time to decide whether to merge.  When you have thousands
>> buffered, this dominates.

Yes, when maxBufferedDocs is relatively high (which is useful, I thought, if you
have the memory to throw at the application and one is trying to stay off IO as
much as possible, that's what I've understood anyway) the loop ends up something
like this, where N is maxBufferedDocs

n + (n-1)+(n-2) + ....(n-n)

(I'm no math wiz, sorry)
You can notice this when indexing and outputing logging information, you see the
'rate' slow down slightly as the number of docs is added to the in memory
buffer, then once the automatic merge is performed, the rate speeds up, then
progressively slows down again.

>> 
>> To optimize this case (small docs, large maxBufferedDocs) we could keep count of
>> the number of documents buffered by adding a bufferedDocCount field. 
>> addDocument could increment this, mergeSegments could decrement it, and
>> maybeMergeSegments could check it with something like:
>> 
>> if (targetMergeDocs == minMergeDocs)  {
>>   mergeDocs = bufferedDocCount;
>> } else {
>>   while (--minSegment >= 0) {
>>   ...
>>   }
>> }
>> 
>> Does that make sense?

Err, to be honest I'm not quite sure what you mean by "small docs" in the above
first statement.  I'm also a little confused on the:

>> if (targetMergeDocs == minMergeDocs)  {

and how it relates to the bufferedDocCount you mention.  

In my hack/patch I'm effectively keeping track of the number of documents added
as you suggest, so I believe we're pretty close to the same thing, but I blame
only having one coffee on trying to understand it.  :)  I think I like where you
going though, it smells right.

Something I thought about last night is that the current code works fine for all
cases, however the 'clean index being rebuilt' seems to get the raw end of the
stick.  When an index is being incrementally or batch updated, IndexWriter
probably does need to scan the Segments to see if they need merging to obey the
configuration settings.  However a fresh index is a special case, and seems like
it could be optimized (this may be what you meant by 'small docs' ?).  

In really large scale indexes/indices(?) the number of documents being
incrementally/batch updated is going to be totally minor compared to the cost of
building the full index for the first time.  For a true High Availability
solution, one will always need to factor in the cost to rebuild the index from
scratch should it come to that.  Making the initial index as fast as possible
gives much smaller down time.  

For crawling applications that use Lucene, this optimization will probably not
even get noticed, because of the latency in retrieving the source material to
make up the document.  Only when the source material can be accessed fast will
this optimization be important.

Removing this CPU bottleneck has even more benefits when you consider those
solutions using Lucene to parallel index documents.  This optimization means
that you get a multiplication of benefit the more CPU's being utilized,
particularly if the architecture is a producer/consumer operation with a Buffer
in between. (obviously IO starts to get in the way more).  With an optimization
the CPU can be better utilized performing tokenization etc.

cheers,

Paul

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message