lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Profiling lucene 5.2.0 based tool
Date Tue, 23 Feb 2016 10:49:34 GMT
Your profiler breakdown is exactly what I'd expect: processing the
fields is the heaviest part of indexing.

Except, it doesn't have any merges?  Did you run it for long enough?
Note that by default Lucene runs merges in a background thread
(ConcurrentMergeScheduler).  If you really must be single thread'd
(why?) then you should use SerialMergeScheduler instead.

The doAfterDocument is likely the flush time (writing the new segment
once the in-heap indexing buffer is full).

Finally, if many of your fields are numeric, 6.0 offers some nice
improvements here with the new dimensional points feature.  See
https://www.elastic.co/blog/lucene-points-6.0 ... but not 6.0 is not
yet released though it should be soon now.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Feb 23, 2016 at 2:01 AM, sandeep das <yarnhadoop@gmail.com> wrote:
> Hi,
>
> I've implemented a tool using lucene-5.2.0 to index my CSV files. The tool
> is reading data from CSV files(residing on disk) and creating indexes on
> local disk. It is able to process 3.5 MBps data. There are overall 46 fields
> being added in one document. They are only of three data types 1. Integer,
> 2. Long, 3. String.
> All these fields are part of one CSV record and they are parsed using custom
> CSV parser which is faster than any split method of string.
>
> I've configured the following parameters to create indexWriter
> 1. setOpenMode(OpenMode.CREATE)
> 2. setCommitOnClose(true)
> 3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
> almost same.
>
> I've read over several blogs that lucene works way faster than these
> figures. So, I thought there are some bottlenecks in my code and profiled it
> using jvisualvm. The application is spending most of the time in
> DefaultIndexChain.processField i.e. 53% of total time.
>
>
> Following is the split of CPU usage in this application:
> 1. reading data from disk is taking 5% of total duration
> 2. adding document is taking 93% of total duration.
>
>    postUpdate  -> 12.8%
>    doAfterDocument -> 20.6%
>    updateDocument  -> 59.8%
>
> finishDocument -> 1.7%
> finishStoreFields -> 4.8%
> processFields -> 53.1%
>
>
> I'm also attaching the screen shot of call graph generated by jvisualvm.
>
> I've taken care of following points:
> 1. create only one instance of indexWriter
> 2. create only one instance of document and reuse it through out the life
> time of application
> 3. There will be no update in the documents hence only addDocument is
> invoked.
> Note: After going through the code I found out that addDocument is
> internally calling updateDocument only. Is there any way by which we can
> avoid calling updateDocument and only use addDocument API?
> 4. Using setValue APIs to set the pre created fields and reusing these
> fields to create indexes.
>
> Any tip to improve the performance will be immensely appreciated.
>
> Regards,
> Sandeep
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message