lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <ishalymi...@yandex-team.ru>
Subject Re: Lucene multithreaded indexing problems
Date Mon, 25 Nov 2013 15:19:59 GMT
Thank you!

But here's what I have.

Today I looked at the indexer in the VisualVM, and I can definitely say that the problem is
in the memory: the resourses (which mostly are Document fields) just don't go away.
I tried different GCs (Parallel, CMS, the default one), and every time the behaviour is the
same.
As I pass my Documents into the indexWriter, I forget about them (the references are all local-scope),
I think the resourses are stuck somewhere in the writer.

I wonder now how do I see:
- how many threads are used by the indexWriter?
- when does it flush segments to disk?

Can I also know whether the indexWriter is done with my Document? Is addDocument() operation
some kind of synchronous?
Do I need to call commit() frequently (I also need to keep segment size constant and use no
merging)?

-- 
Igor

23.11.2013, 20:29, "Daniel Penning" <dpenning@gamona.de>:
> G1 and CMS are both tuned primarily for low pauses which is typically
> prefered for searching an index. In this case i guess that indexing
> throughput is prefered in which case using ParallelGC might be the
> better choice.
>
> Am 23.11.2013 17:15, schrieb Uwe Schindler:
>
>>  Hi,
>>
>>  Maybe your heap size is just too big, so your JVM spends too much time in GC? The
setup you described in your last eMail ist the "official supported" setup :-) Lucene has no
problem with that setup and can index. Be sure:
>>  - Don't give too much heap to your indexing app. Larger heaps create much more
GC load.
>>  - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS Collector).
Other garbage collectors may do GCs in a single thread ("stop-the-world").
>>
>>  Uwe
>>  -----
>>  Uwe Schindler
>>  H.-H.-Meier-Allee 63, D-28213 Bremen
>>  http://www.thetaphi.de
>>  eMail: uwe@thetaphi.de
>>>  -----Original Message-----
>>>  From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>  Sent: Saturday, November 23, 2013 4:46 PM
>>>  To: java-user@lucene.apache.org
>>>  Subject: Re: Lucene multithreaded indexing problems
>>>
>>>  So we return to the initially described setup: multiple parallel workers, each
>>>  making "parse + indexWriter.addDocument()" for single documents with no
>>>  synchronization at my side. This setup was also bad on memory consumption
>>>  and thread blocking, as I reported.
>>>
>>>  Or did I misunderstand you?
>>>
>>>  --
>>>  Igor
>>>
>>>  22.11.2013, 23:34, "Uwe Schindler" <uwe@thetaphi.de>:
>>>>  Hi,
>>>>  Don't use addDocuments. This method is more made for so called block
>>>  indexing (where all documents need to be on a block for block joins). Call
>>>  addDocument for each document possibly from many threads.  By this
>>>  Lucene can better handle multithreading and free memory early. There is
>>>  really no need to use bulk adds, this is solely for block joins, where docs
need
>>>  to be sequential and without gaps.
>>>>  Uwe
>>>>
>>>>  Igor Shalyminov <ishalyminov@yandex-team.ru> schrieb:
>>>>>  - uwe@
>>>>>
>>>>>  Thanks Uwe!
>>>>>
>>>>>  I changed the logic so that my workers only parse input docs into
>>>>>  Documents, and indexWriter does addDocuments() by itself for the
>>>>>  chunks of 100 Documents.
>>>>>  Unfortunately, this behaviour reproduces: memory usage slightly
>>>>>  increases with the number of processed documents, and at some point
>>>>>  the program runs very slowly, and it seems that only a single thread
>>>>>  is active.
>>>>>  It happens after lots of parse/index cycles.
>>>>>
>>>>>  The current instance is now in the "single-thread" phase with ~100%
>>>>>  CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
>>>>>  My question is, when does addDocuments() release all resourses passed
>>>>>  in (the Documents themselves)?
>>>>>  Are the resourses released after finishing the function call, or I
>>>>>  have to do indexWriter.commit() after, say, each chunk?
>>>>>
>>>>>  --
>>>>>  Igor
>>>>>
>>>>>  21.11.2013, 19:59, "Uwe Schindler" <uwe@thetaphi.de>:
>>>>>>    Hi,
>>>>>>
>>>>>>    why are you doing this? Lucene's IndexWriter can handle
>>>>>>  addDocuments
>>>>>  in multiple threads. And, since Lucene 4, it will process them almost
>>>>>  completely parallel!
>>>>>>    If you do the addDocuments single-threaded you are adding an
>>>>>  additional bottleneck in your application. If you are doing a
>>>>>  synchronization on IndexWriter (which I hope you will not do), things
>>>>>  will go wrong, too.
>>>>>>    Uwe
>>>>>>
>>>>>>    -----
>>>>>>    Uwe Schindler
>>>>>>    H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>>    http://www.thetaphi.de
>>>>>>    eMail: uwe@thetaphi.de
>>>>>>>     -----Original Message-----
>>>>>>>     From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>>>>>     Sent: Thursday, November 21, 2013 4:45 PM
>>>>>>>     To: java-user@lucene.apache.org
>>>>>>>     Subject: Lucene multithreaded indexing problems
>>>>>>>
>>>>>>>     Hello!
>>>>>>>
>>>>>>>     I tried to perform indexing multithreadedly, with a
>>>>>>>  FixedThreadPool
>>>>>  of
>>>>>>>     Callable workers.
>>>>>>>     The main operation - parsing a single document and addDocument()
>>>>>>>  to
>>>>>  the
>>>>>>>     index - is done by a single worker.
>>>>>>>     After parsing a document, a lot (really a lot) of Strings
>>>>>>>  appears,
>>>>>  and at the
>>>>>>>     end of the worker's call() all of them goes to the indexWriter.
>>>>>>>     I use no merging, the resourses are flushed on disk when
the
>>>>>  segment size
>>>>>>>     limit is reached.
>>>>>>>
>>>>>>>     The problem is, after a little while (when the most of
the heap
>>>>>  memory is
>>>>>>>     used) indexer makes no progress, and CPU load is constant
100%
>>>>>>>  (no
>>>>>>>     difference if there are 2 threads or 32). So I think
at some
>>>>>>>  point
>>>>>  garbage
>>>>>>>     collection takes the whole indexing process down.
>>>>>>>
>>>>>>>     Could you please give some advices on the proper concurrent
>>>>>  indexing with
>>>>>>>     Lucene?
>>>>>>>     Can there be "memory leaks" somewhere in the indexWriter?
Maybe
>>>  I
>>>>>  must
>>>>>>>     perform some operations with writer to release unused
resourses
>>>>>  from time
>>>>>>>     to time?
>>>>>>>
>>>>>>>     --
>>>>>>>     Best Regards,
>>>>>>>     Igor
>>>>>  ---------------------------------------------------------------------
>>>>>>>     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>     For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>  --------------------------------------------------------------------
>>>>>>  -
>>>>>>    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>    For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>  ---------------------------------------------------------------------
>>>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>  --
>>>>  Uwe Schindler
>>>>  H.-H.-Meier-Allee 63, 28213 Bremen
>>>>  http://www.thetaphi.de
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message