lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Code Freeze on realtime_search branch
Date Fri, 29 Apr 2011 19:14:41 GMT
Sorry, but, no :)

So feel free to keep working towards removing this limitation!!

This change makes IndexWriter's flush (where it writes the added
documents in RAM to disk as a new segment) fully concurrent, so that
while one segment is being flushed (which could take a longish time,
eg on a slowish IO system), other threads are now free to continue
indexing (where they were blocked before).  On computers with
substantial CPU concurrency, and fast "enough" IO systems, this change
should give a big increase in indexing throughput.

That said, I do think this change is a step towards what you seek
(allowing multiple IndexWriters, even in separate JVMs maybe on
separate computers, to write into an index at once).

Mike

http://blog.mikemccandless.com

On Fri, Apr 29, 2011 at 2:16 PM, Sanne Grinovero
<sanne.grinovero@gmail.com> wrote:
> Hello,
> this is totally awesome!
>
> Does it imply we don't need the IndexWriter lock anymore? And hence
> that people sharing the Lucene Directory across multiple JVMs can have
> both write at the same time?
>
> I had intentions to *try* removing such limitations this summer, but
> if this is the case I will spend my time testing this carefully
> instead, or if some kind of locking is still required I'd appreciate
> some pointers so that I'll be able to remove them.
>
> Regards,
> Sanne
>
> 2011/4/29 Simon Willnauer <simon.willnauer@googlemail.com>:
>> Hey folks,
>>
>> LUCENE-3023 aims to land the considerably large
>> DocumentsWriterPerThread (DWPT) refactoring on trunk.
>> During the last weeks we have put lots of efforts into cleaning the
>> code up, fixing javadocs and run test locally
>> as well as on Jenkins. We reached the point where we are able to
>> create a final patch for review and land this
>> exciting refactoring on trunk very soon. I committed the CHANGES.TXT
>> entry (also appended below) a couple of minutes ago so from now on
>> we freeze the branch for final review (Robert can you create a new
>> "final" patch and upload to LUCENE-3023).
>> Any comments should go to [1] or as a reply to this email. If there is
>> no blocker coming up we plan to reintegrate the
>> branch and commit it to trunk early next week. For those who want some
>> background what DWPT does read: [2]
>>
>> Note: this change will not change the index file format so there is no
>> need to reindex for trunk users. Yet, I will send a heads up next week
>> with an
>> overview of that has changed.
>>
>> Simon
>>
>> [1] https://issues.apache.org/jira/browse/LUCENE-3023
>> [2] http://blog.jteam.nl/2011/04/01/gimme-all-resources-you-have-i-can-use-them/
>>
>>
>> * LUCENE-2956, LUCENE-2573, LUCENE-2324, LUCENE-2555: Changes from
>>  DocumentsWriterPerThread:
>>
>>  - IndexWriter now uses a DocumentsWriter per thread when indexing documents.
>>    Each DocumentsWriterPerThread indexes documents in its own private segment,
>>    and the in memory segments are no longer merged on flush.  Instead, each
>>    segment is separately flushed to disk and subsequently merged with normal
>>    segment merging.
>>
>>  - DocumentsWriterPerThread (DWPT) is now flushed concurrently based on a
>>    FlushPolicy.  When a DWPT is flushed, a fresh DWPT is swapped in so that
>>    indexing may continue concurrently with flushing.  The selected
>>    DWPT flushes all its RAM resident documents do disk.  Note: Segment flushes
>>    don't flush all RAM resident documents but only the documents private to
>>    the DWPT selected for flushing.
>>
>>  - Flushing is now controlled by FlushPolicy that is called for every add,
>>    update or delete on IndexWriter. By default DWPTs are flushed either on
>>    maxBufferedDocs per DWPT or the global active used memory. Once the active
>>    memory exceeds ramBufferSizeMB only the largest DWPT is selected for
>>    flushing and the memory used by this DWPT is substracted from the active
>>    memory and added to a flushing memory pool, which can lead to temporarily
>>    higher memory usage due to ongoing indexing.
>>
>>  - IndexWriter now can utilize ramBufferSize > 2048 MB. Each DWPT can address
>>    up to 2048 MB memory such that the ramBufferSize is now bounded by the max
>>    number of DWPT avaliable in the used DocumentsWriterPerThreadPool.
>>    IndexWriters net memory consumption can grow far beyond the 2048 MB limit if
>>    the applicatoin can use all available DWPTs. To prevent a DWPT from
>>    exhausting its address space IndexWriter will forcefully flush a DWPT if its
>>    hard memory limit is exceeded. The RAMPerThreadHardLimitMB can be controlled
>>    via IndexWriterConfig and defaults to 1945 MB.
>>    Since IndexWriter flushes DWPT concurrently not all memory is released
>>    immediately. Applications should still use a ramBufferSize significantly
>>    lower than the JVMs avaliable heap memory since under high load multiple
>>    flushing DWPT can consume substantial transient memory when IO performance
>>    is slow relative to indexing rate.
>>
>>  - IndexWriter#commit now doesn't block concurrent indexing while flushing all
>>    'currently' RAM resident documents to disk. Yet, flushes that occur while a
>>    a full flush is running are queued and will happen after all DWPT involved
>>    in the full flush are done flushing. Applications using multiple threads
>>    during indexing and trigger a full flush (eg call commmit() or open a new
>>    NRT reader) can use significantly more transient memory.
>>
>>  - IndexWriter#addDocument and IndexWriter.updateDocument can block indexing
>>    threads if the number of active + number of flushing DWPT exceed a
>>    safety limit. By default this happens if 2 * max number available thread
>>    states (DWPTPool) is exceeded. This safety limit prevents applications from
>>    exhausting their available memory if flushing can't keep up with
>>    concurrently indexing threads.
>>
>>  - IndexWriter only applies and flushes deletes if the maxBufferedDelTerms
>>    limit is reached during indexing. No segment flushes will be triggered
>>    due to this setting.
>>
>>  - IndexWriter#flush(boolean, boolean) doesn't synchronized on IndexWriter
>>    anymore. A dedicated flushLock has been introduced to prevent multiple full-
>>    flushes happening concurrently.
>>
>>  - DocumentsWriter doesn't write shared doc stores anymore.
>>
>>  (Mike McCandless, Michael Busch, Simon Willnauer)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message