lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sandeep das <yarnhad...@gmail.com>
Subject Re: Profiling lucene 5.2.0 based tool
Date Tue, 23 Feb 2016 11:02:22 GMT
Thanks a lot guys. I really appreciate your response on my query. I'll
create multiple threads and checkout that how much I can rate can be
increased per thread.


Regards,
Sandeep

On Tue, Feb 23, 2016 at 4:19 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Your profiler breakdown is exactly what I'd expect: processing the
> fields is the heaviest part of indexing.
>
> Except, it doesn't have any merges?  Did you run it for long enough?
> Note that by default Lucene runs merges in a background thread
> (ConcurrentMergeScheduler).  If you really must be single thread'd
> (why?) then you should use SerialMergeScheduler instead.
>
> The doAfterDocument is likely the flush time (writing the new segment
> once the in-heap indexing buffer is full).
>
> Finally, if many of your fields are numeric, 6.0 offers some nice
> improvements here with the new dimensional points feature.  See
> https://www.elastic.co/blog/lucene-points-6.0 ... but not 6.0 is not
> yet released though it should be soon now.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Feb 23, 2016 at 2:01 AM, sandeep das <yarnhadoop@gmail.com> wrote:
> > Hi,
> >
> > I've implemented a tool using lucene-5.2.0 to index my CSV files. The
> tool
> > is reading data from CSV files(residing on disk) and creating indexes on
> > local disk. It is able to process 3.5 MBps data. There are overall 46
> fields
> > being added in one document. They are only of three data types 1.
> Integer,
> > 2. Long, 3. String.
> > All these fields are part of one CSV record and they are parsed using
> custom
> > CSV parser which is faster than any split method of string.
> >
> > I've configured the following parameters to create indexWriter
> > 1. setOpenMode(OpenMode.CREATE)
> > 2. setCommitOnClose(true)
> > 3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
> > almost same.
> >
> > I've read over several blogs that lucene works way faster than these
> > figures. So, I thought there are some bottlenecks in my code and
> profiled it
> > using jvisualvm. The application is spending most of the time in
> > DefaultIndexChain.processField i.e. 53% of total time.
> >
> >
> > Following is the split of CPU usage in this application:
> > 1. reading data from disk is taking 5% of total duration
> > 2. adding document is taking 93% of total duration.
> >
> >    postUpdate  -> 12.8%
> >    doAfterDocument -> 20.6%
> >    updateDocument  -> 59.8%
> >
> > finishDocument -> 1.7%
> > finishStoreFields -> 4.8%
> > processFields -> 53.1%
> >
> >
> > I'm also attaching the screen shot of call graph generated by jvisualvm.
> >
> > I've taken care of following points:
> > 1. create only one instance of indexWriter
> > 2. create only one instance of document and reuse it through out the life
> > time of application
> > 3. There will be no update in the documents hence only addDocument is
> > invoked.
> > Note: After going through the code I found out that addDocument is
> > internally calling updateDocument only. Is there any way by which we can
> > avoid calling updateDocument and only use addDocument API?
> > 4. Using setValue APIs to set the pre created fields and reusing these
> > fields to create indexes.
> >
> > Any tip to improve the performance will be immensely appreciated.
> >
> > Regards,
> > Sandeep
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message