lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sandeep das <yarnhad...@gmail.com>
Subject Re: Profiling lucene 5.2.0 based tool
Date Tue, 23 Feb 2016 07:29:38 GMT
Hi Rob,

The statistics which I had shared were provided using one thread for
indexing. I wish to use only 1 thread and want to process maximum
10MBps(Mega Bytes per second) of data rate. I believe with single thread it
should be achievable.

Regards,
Sandeep

On Tue, Feb 23, 2016 at 12:50 PM, Rob Audenaerde <rob.audenaerde@gmail.com>
wrote:

> Hi Sandeep,
>
> How many threads do you use to do the indexing? The benchmarks of Lucene
> are done on >20 threads IIRC.
>
> -Rob
>
> On Tue, Feb 23, 2016 at 8:01 AM, sandeep das <yarnhadoop@gmail.com> wrote:
>
> > Hi,
> >
> > I've implemented a tool using lucene-5.2.0 to index my CSV files. The
> tool
> > is reading data from CSV files(residing on disk) and creating indexes on
> > local disk. It is able to process 3.5 MBps data. There are overall 46
> > fields being added in one document. They are only of three data types 1.
> > Integer, 2. Long, 3. String.
> > All these fields are part of one CSV record and they are parsed using
> > custom CSV parser which is faster than any split method of string.
> >
> > I've configured the following parameters to create indexWriter
> > 1. setOpenMode(OpenMode.CREATE)
> > 2. setCommitOnClose(true)
> > 3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
> > almost same.
> >
> > I've read over several blogs that lucene works way faster than these
> > figures. So, I thought there are some bottlenecks in my code and profiled
> > it using jvisualvm. The application is spending most of the time in
> > DefaultIndexChain.processField i.e. 53% of total time.
> >
> >
> > Following is the split of CPU usage in this application:
> > 1. reading data from disk is taking 5% of total duration
> > 2. adding document is taking 93% of total duration.
> >
> >    -    postUpdate  -> 12.8%
> >    -    doAfterDocument -> 20.6%
> >    -    updateDocument  -> 59.8%
> >       - finishDocument -> 1.7%
> >       - finishStoreFields -> 4.8%
> >       - processFields -> 53.1%
> >
> >
> > I'm also attaching the screen shot of call graph generated by jvisualvm.
> >
> > I've taken care of following points:
> > 1. create only one instance of indexWriter
> > 2. create only one instance of document and reuse it through out the life
> > time of application
> > 3. There will be no update in the documents hence only addDocument is
> > invoked.
> > Note: After going through the code I found out that addDocument is
> > internally calling updateDocument only. Is there any way by which we can
> > avoid calling updateDocument and only use addDocument API?
> > 4. Using setValue APIs to set the pre created fields and reusing these
> > fields to create indexes.
> >
> > Any tip to improve the performance will be immensely appreciated.
> >
> > Regards,
> > Sandeep
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message