lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: IndexWriter.applyDeletes performance
Date Fri, 05 Mar 2010 15:29:06 GMT
OK I opened:

  https://issues.apache.org/jira/browse/LUCENE-2297

Mike

On Fri, Mar 5, 2010 at 10:25 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> Currently you can't tell IW to use the pool (ie, pool is only enabled
> if you use NRT readers).  We should probably make this an option at
> ctor time, for situations like this.  (In fact, in followon
> discussions about further improvements to NRT we've already discussed
> having such an option to IW's ctors).  I'll open an issue for this.
>
> Indeed from that profiler output it looks like most of the time is
> being spent opening the SegmentReaders (to do deletes), specifically
> loading the terms dict index (64% overall) and loading the deleted
> docs (10%).
>
> But... how long does step 2 take?  Is it an option to not commit on
> every update?  How many docs do you typically update?
>
> If you are committing only so that an outside reader can reopen, you
> should consider just using an NRT reader instead (assuming the reader
> is in same JVM as IndexWriter).
>
> Roughly how much more RAM consumption do you see when you force pooling?
>
> Mike
>
> On Fri, Mar 5, 2010 at 9:18 AM, Bogdan Ghidireac <bogdan@ecstend.com> wrote:
>> Hi,
>>
>> I have an index with 100 million docs that has around 20GB on disk and
>> an update rate of few hundred docs per minute. The new docs are
>> grouped in batches and indexed once every few minutes. My problem is
>> that the update performance degraded too much over time as the index
>> increased in size (distinct docs).
>>
>> My indexing flow looks like this ..
>>
>> 0. create indexWriter (only once)
>> 1. get the open indexWriter
>> 2. for each doc call indexWriter.updateDocument(pkTerm, doc)
>> 3. indexWriter.commit
>> 4. indexWriter.waitForMerges
>> 5. wait for new docs and goto 1.
>>
>> I ran a profiler for several minutes and I noticed that most of the
>> time the indexer is busy applying the deletes. This takes so much time
>> because all terms are loaded for every commit (see the attached
>> profiler screenshot).
>>
>> The index writer has a pool or readers but they are not used unless
>> near real time is enabled. I changed my code to force the pool to be
>> used but the only way I can do this is to request a reader that is
>> never used writer.getReader(). Of course, the memory consumption is
>> higher now because I have terms in memory but the steps 3+4 compete in
>> 1-2 secs compared to 8-10 secs.
>>
>> Is is possible to enable the readers pool at the IndexWriter
>> constructor level? My current method looks like a hack ...
>> I am using Lucene 2.9.2. on Linux.
>>
>> Regards,
>> Bogdan
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message