lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sandeep das <>
Subject Profiling lucene 5.2.0 based tool
Date Tue, 23 Feb 2016 07:01:20 GMT

I've implemented a tool using lucene-5.2.0 to index my CSV files. The tool
is reading data from CSV files(residing on disk) and creating indexes on
local disk. It is able to process 3.5 MBps data. There are overall 46
fields being added in one document. They are only of three data types 1.
Integer, 2. Long, 3. String.
All these fields are part of one CSV record and they are parsed using
custom CSV parser which is faster than any split method of string.

I've configured the following parameters to create indexWriter
1. setOpenMode(OpenMode.CREATE)
2. setCommitOnClose(true)
3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
almost same.

I've read over several blogs that lucene works way faster than these
figures. So, I thought there are some bottlenecks in my code and profiled
it using jvisualvm. The application is spending most of the time in
DefaultIndexChain.processField i.e. 53% of total time.

Following is the split of CPU usage in this application:
1. reading data from disk is taking 5% of total duration
2. adding document is taking 93% of total duration.

   -    postUpdate  -> 12.8%
   -    doAfterDocument -> 20.6%
   -    updateDocument  -> 59.8%
      - finishDocument -> 1.7%
      - finishStoreFields -> 4.8%
      - processFields -> 53.1%

I'm also attaching the screen shot of call graph generated by jvisualvm.

I've taken care of following points:
1. create only one instance of indexWriter
2. create only one instance of document and reuse it through out the life
time of application
3. There will be no update in the documents hence only addDocument is
Note: After going through the code I found out that addDocument is
internally calling updateDocument only. Is there any way by which we can
avoid calling updateDocument and only use addDocument API?
4. Using setValue APIs to set the pre created fields and reusing these
fields to create indexes.

Any tip to improve the performance will be immensely appreciated.


View raw message