lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: IndexWriter.applyDeletes performance
Date Fri, 05 Mar 2010 15:29:06 GMT
OK I opened:


On Fri, Mar 5, 2010 at 10:25 AM, Michael McCandless
<> wrote:
> Currently you can't tell IW to use the pool (ie, pool is only enabled
> if you use NRT readers).  We should probably make this an option at
> ctor time, for situations like this.  (In fact, in followon
> discussions about further improvements to NRT we've already discussed
> having such an option to IW's ctors).  I'll open an issue for this.
> Indeed from that profiler output it looks like most of the time is
> being spent opening the SegmentReaders (to do deletes), specifically
> loading the terms dict index (64% overall) and loading the deleted
> docs (10%).
> But... how long does step 2 take?  Is it an option to not commit on
> every update?  How many docs do you typically update?
> If you are committing only so that an outside reader can reopen, you
> should consider just using an NRT reader instead (assuming the reader
> is in same JVM as IndexWriter).
> Roughly how much more RAM consumption do you see when you force pooling?
> Mike
> On Fri, Mar 5, 2010 at 9:18 AM, Bogdan Ghidireac <> wrote:
>> Hi,
>> I have an index with 100 million docs that has around 20GB on disk and
>> an update rate of few hundred docs per minute. The new docs are
>> grouped in batches and indexed once every few minutes. My problem is
>> that the update performance degraded too much over time as the index
>> increased in size (distinct docs).
>> My indexing flow looks like this ..
>> 0. create indexWriter (only once)
>> 1. get the open indexWriter
>> 2. for each doc call indexWriter.updateDocument(pkTerm, doc)
>> 3. indexWriter.commit
>> 4. indexWriter.waitForMerges
>> 5. wait for new docs and goto 1.
>> I ran a profiler for several minutes and I noticed that most of the
>> time the indexer is busy applying the deletes. This takes so much time
>> because all terms are loaded for every commit (see the attached
>> profiler screenshot).
>> The index writer has a pool or readers but they are not used unless
>> near real time is enabled. I changed my code to force the pool to be
>> used but the only way I can do this is to request a reader that is
>> never used writer.getReader(). Of course, the memory consumption is
>> higher now because I have terms in memory but the steps 3+4 compete in
>> 1-2 secs compared to 8-10 secs.
>> Is is possible to enable the readers pool at the IndexWriter
>> constructor level? My current method looks like a hack ...
>> I am using Lucene 2.9.2. on Linux.
>> Regards,
>> Bogdan
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message