lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Penning <dpenn...@gamona.de>
Subject Re: Lucene multithreaded indexing problems
Date Sat, 23 Nov 2013 16:28:50 GMT
G1 and CMS are both tuned primarily for low pauses which is typically 
prefered for searching an index. In this case i guess that indexing 
throughput is prefered in which case using ParallelGC might be the 
better choice.

Am 23.11.2013 17:15, schrieb Uwe Schindler:
> Hi,
>
> Maybe your heap size is just too big, so your JVM spends too much time in GC? The setup
you described in your last eMail ist the "official supported" setup :-) Lucene has no problem
with that setup and can index. Be sure:
> - Don't give too much heap to your indexing app. Larger heaps create much more GC load.
> - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS Collector).
Other garbage collectors may do GCs in a single thread ("stop-the-world").
>
> Uwe
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>> Sent: Saturday, November 23, 2013 4:46 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene multithreaded indexing problems
>>
>> So we return to the initially described setup: multiple parallel workers, each
>> making "parse + indexWriter.addDocument()" for single documents with no
>> synchronization at my side. This setup was also bad on memory consumption
>> and thread blocking, as I reported.
>>
>> Or did I misunderstand you?
>>
>> --
>> Igor
>>
>> 22.11.2013, 23:34, "Uwe Schindler" <uwe@thetaphi.de>:
>>> Hi,
>>> Don't use addDocuments. This method is more made for so called block
>> indexing (where all documents need to be on a block for block joins). Call
>> addDocument for each document possibly from many threads.  By this
>> Lucene can better handle multithreading and free memory early. There is
>> really no need to use bulk adds, this is solely for block joins, where docs need
>> to be sequential and without gaps.
>>> Uwe
>>>
>>> Igor Shalyminov <ishalyminov@yandex-team.ru> schrieb:
>>>
>>>> - uwe@
>>>>
>>>> Thanks Uwe!
>>>>
>>>> I changed the logic so that my workers only parse input docs into
>>>> Documents, and indexWriter does addDocuments() by itself for the
>>>> chunks of 100 Documents.
>>>> Unfortunately, this behaviour reproduces: memory usage slightly
>>>> increases with the number of processed documents, and at some point
>>>> the program runs very slowly, and it seems that only a single thread
>>>> is active.
>>>> It happens after lots of parse/index cycles.
>>>>
>>>> The current instance is now in the "single-thread" phase with ~100%
>>>> CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
>>>> My question is, when does addDocuments() release all resourses passed
>>>> in (the Documents themselves)?
>>>> Are the resourses released after finishing the function call, or I
>>>> have to do indexWriter.commit() after, say, each chunk?
>>>>
>>>> --
>>>> Igor
>>>>
>>>> 21.11.2013, 19:59, "Uwe Schindler" <uwe@thetaphi.de>:
>>>>>   Hi,
>>>>>
>>>>>   why are you doing this? Lucene's IndexWriter can handle
>>>>> addDocuments
>>>> in multiple threads. And, since Lucene 4, it will process them almost
>>>> completely parallel!
>>>>>   If you do the addDocuments single-threaded you are adding an
>>>> additional bottleneck in your application. If you are doing a
>>>> synchronization on IndexWriter (which I hope you will not do), things
>>>> will go wrong, too.
>>>>>   Uwe
>>>>>
>>>>>   -----
>>>>>   Uwe Schindler
>>>>>   H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>   http://www.thetaphi.de
>>>>>   eMail: uwe@thetaphi.de
>>>>>>    -----Original Message-----
>>>>>>    From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>>>>    Sent: Thursday, November 21, 2013 4:45 PM
>>>>>>    To: java-user@lucene.apache.org
>>>>>>    Subject: Lucene multithreaded indexing problems
>>>>>>
>>>>>>    Hello!
>>>>>>
>>>>>>    I tried to perform indexing multithreadedly, with a
>>>>>> FixedThreadPool
>>>> of
>>>>>>    Callable workers.
>>>>>>    The main operation - parsing a single document and addDocument()
>>>>>> to
>>>> the
>>>>>>    index - is done by a single worker.
>>>>>>    After parsing a document, a lot (really a lot) of Strings
>>>>>> appears,
>>>> and at the
>>>>>>    end of the worker's call() all of them goes to the indexWriter.
>>>>>>    I use no merging, the resourses are flushed on disk when the
>>>> segment size
>>>>>>    limit is reached.
>>>>>>
>>>>>>    The problem is, after a little while (when the most of the heap
>>>> memory is
>>>>>>    used) indexer makes no progress, and CPU load is constant 100%
>>>>>> (no
>>>>>>    difference if there are 2 threads or 32). So I think at some
>>>>>> point
>>>> garbage
>>>>>>    collection takes the whole indexing process down.
>>>>>>
>>>>>>    Could you please give some advices on the proper concurrent
>>>> indexing with
>>>>>>    Lucene?
>>>>>>    Can there be "memory leaks" somewhere in the indexWriter? Maybe
>> I
>>>> must
>>>>>>    perform some operations with writer to release unused resourses
>>>> from time
>>>>>>    to time?
>>>>>>
>>>>>>    --
>>>>>>    Best Regards,
>>>>>>    Igor
>>>> ---------------------------------------------------------------------
>>>>>>    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>    For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> --------------------------------------------------------------------
>>>>> -
>>>>>   To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>   For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> --
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, 28213 Bremen
>>> http://www.thetaphi.de
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message