lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Lucene multithreaded indexing problems
Date Mon, 25 Nov 2013 19:30:30 GMT
Hi,

> But here's what I have.
> 
> Today I looked at the indexer in the VisualVM, and I can definitely say that
> the problem is in the memory: the resourses (which mostly are Document
> fields) just don't go away.
> I tried different GCs (Parallel, CMS, the default one), and every time the
> behaviour is the same.
> As I pass my Documents into the indexWriter, I forget about them (the
> references are all local-scope), I think the resourses are stuck somewhere in
> the writer.

That is strange! Are you sure this is the case? Maybe you are using Readers in your Fields?

> I wonder now how do I see:
> - how many threads are used by the indexWriter?

As many threads as you use for indexing, up to maxThreadStates. If you use more threads, addDocument()
will block.

> - when does it flush segments to disk?

This is done dependent on different settings. But while doing this, the documents are no longer
referenced! The work of analyzing the documents is done in the addDocument() call. When the
method returns, the document is no longer in use.

> Can I also know whether the indexWriter is done with my Document? Is
> addDocument() operation some kind of synchronous?

When addDocument() returns, the Document is no longer referenced. Ideally you can reuse the
Document/Field instances!

> Do I need to call commit() frequently (I also need to keep segment size
> constant and use no merging)?

Write your own MergePolicy to control how merging is done.

> 
> --
> Igor
> 
> 23.11.2013, 20:29, "Daniel Penning" <dpenning@gamona.de>:
> > G1 and CMS are both tuned primarily for low pauses which is typically
> > prefered for searching an index. In this case i guess that indexing
> > throughput is prefered in which case using ParallelGC might be the
> > better choice.
> >
> > Am 23.11.2013 17:15, schrieb Uwe Schindler:
> >
> >>  Hi,
> >>
> >>  Maybe your heap size is just too big, so your JVM spends too much time
> in GC? The setup you described in your last eMail ist the "official supported"
> setup :-) Lucene has no problem with that setup and can index. Be sure:
> >>  - Don't give too much heap to your indexing app. Larger heaps create
> much more GC load.
> >>  - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS
> Collector). Other garbage collectors may do GCs in a single thread ("stop-the-
> world").
> >>
> >>  Uwe
> >>  -----
> >>  Uwe Schindler
> >>  H.-H.-Meier-Allee 63, D-28213 Bremen
> >>  http://www.thetaphi.de
> >>  eMail: uwe@thetaphi.de
> >>>  -----Original Message-----
> >>>  From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> >>>  Sent: Saturday, November 23, 2013 4:46 PM
> >>>  To: java-user@lucene.apache.org
> >>>  Subject: Re: Lucene multithreaded indexing problems
> >>>
> >>>  So we return to the initially described setup: multiple parallel
> >>> workers, each
> >>>  making "parse + indexWriter.addDocument()" for single documents
> >>> with no
> >>>  synchronization at my side. This setup was also bad on memory
> >>> consumption
> >>>  and thread blocking, as I reported.
> >>>
> >>>  Or did I misunderstand you?
> >>>
> >>>  --
> >>>  Igor
> >>>
> >>>  22.11.2013, 23:34, "Uwe Schindler" <uwe@thetaphi.de>:
> >>>>  Hi,
> >>>>  Don't use addDocuments. This method is more made for so called
> >>>> block
> >>>  indexing (where all documents need to be on a block for block
> >>> joins). Call
> >>>  addDocument for each document possibly from many threads.  By this
> >>>  Lucene can better handle multithreading and free memory early.
> >>> There is
> >>>  really no need to use bulk adds, this is solely for block joins,
> >>> where docs need
> >>>  to be sequential and without gaps.
> >>>>  Uwe
> >>>>
> >>>>  Igor Shalyminov <ishalyminov@yandex-team.ru> schrieb:
> >>>>>  - uwe@
> >>>>>
> >>>>>  Thanks Uwe!
> >>>>>
> >>>>>  I changed the logic so that my workers only parse input docs into
> >>>>>  Documents, and indexWriter does addDocuments() by itself for the
> >>>>>  chunks of 100 Documents.
> >>>>>  Unfortunately, this behaviour reproduces: memory usage slightly
> >>>>>  increases with the number of processed documents, and at some
> >>>>> point
> >>>>>  the program runs very slowly, and it seems that only a single
> >>>>> thread
> >>>>>  is active.
> >>>>>  It happens after lots of parse/index cycles.
> >>>>>
> >>>>>  The current instance is now in the "single-thread" phase with
> >>>>> ~100%
> >>>>>  CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
> >>>>>  My question is, when does addDocuments() release all resourses
> >>>>> passed
> >>>>>  in (the Documents themselves)?
> >>>>>  Are the resourses released after finishing the function call, or
> >>>>> I
> >>>>>  have to do indexWriter.commit() after, say, each chunk?
> >>>>>
> >>>>>  --
> >>>>>  Igor
> >>>>>
> >>>>>  21.11.2013, 19:59, "Uwe Schindler" <uwe@thetaphi.de>:
> >>>>>>    Hi,
> >>>>>>
> >>>>>>    why are you doing this? Lucene's IndexWriter can handle
> >>>>>>  addDocuments
> >>>>>  in multiple threads. And, since Lucene 4, it will process them
> >>>>> almost
> >>>>>  completely parallel!
> >>>>>>    If you do the addDocuments single-threaded you are adding
an
> >>>>>  additional bottleneck in your application. If you are doing a
> >>>>>  synchronization on IndexWriter (which I hope you will not do),
> >>>>> things
> >>>>>  will go wrong, too.
> >>>>>>    Uwe
> >>>>>>
> >>>>>>    -----
> >>>>>>    Uwe Schindler
> >>>>>>    H.-H.-Meier-Allee 63, D-28213 Bremen
> >>>>>>    http://www.thetaphi.de
> >>>>>>    eMail: uwe@thetaphi.de
> >>>>>>>     -----Original Message-----
> >>>>>>>     From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> >>>>>>>     Sent: Thursday, November 21, 2013 4:45 PM
> >>>>>>>     To: java-user@lucene.apache.org
> >>>>>>>     Subject: Lucene multithreaded indexing problems
> >>>>>>>
> >>>>>>>     Hello!
> >>>>>>>
> >>>>>>>     I tried to perform indexing multithreadedly, with a
> >>>>>>>  FixedThreadPool
> >>>>>  of
> >>>>>>>     Callable workers.
> >>>>>>>     The main operation - parsing a single document and
> >>>>>>> addDocument()
> >>>>>>>  to
> >>>>>  the
> >>>>>>>     index - is done by a single worker.
> >>>>>>>     After parsing a document, a lot (really a lot) of Strings
> >>>>>>>  appears,
> >>>>>  and at the
> >>>>>>>     end of the worker's call() all of them goes to the indexWriter.
> >>>>>>>     I use no merging, the resourses are flushed on disk
when the
> >>>>>  segment size
> >>>>>>>     limit is reached.
> >>>>>>>
> >>>>>>>     The problem is, after a little while (when the most
of the
> >>>>>>> heap
> >>>>>  memory is
> >>>>>>>     used) indexer makes no progress, and CPU load is constant
> >>>>>>> 100%
> >>>>>>>  (no
> >>>>>>>     difference if there are 2 threads or 32). So I think
at some
> >>>>>>>  point
> >>>>>  garbage
> >>>>>>>     collection takes the whole indexing process down.
> >>>>>>>
> >>>>>>>     Could you please give some advices on the proper concurrent
> >>>>>  indexing with
> >>>>>>>     Lucene?
> >>>>>>>     Can there be "memory leaks" somewhere in the indexWriter?
> >>>>>>> Maybe
> >>>  I
> >>>>>  must
> >>>>>>>     perform some operations with writer to release unused
> >>>>>>> resourses
> >>>>>  from time
> >>>>>>>     to time?
> >>>>>>>
> >>>>>>>     --
> >>>>>>>     Best Regards,
> >>>>>>>     Igor
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>>>>>     To unsubscribe, e-mail:
> >>>>>>> java-user-unsubscribe@lucene.apache.org
> >>>>>>>     For additional commands, e-mail:
> >>>>>>> java-user-help@lucene.apache.org
> >>>>>>
> >>>>>> -----------------------------------------------------------------
> >>>>>> ---
> >>>>>>  -
> >>>>>>    To unsubscribe, e-mail:
> >>>>>> java-user-unsubscribe@lucene.apache.org
> >>>>>>    For additional commands, e-mail:
> >>>>>> java-user-help@lucene.apache.org
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>  --
> >>>>  Uwe Schindler
> >>>>  H.-H.-Meier-Allee 63, 28213 Bremen
> >>>>  http://www.thetaphi.de
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >> ---------------------------------------------------------------------
> >>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message