lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Profiling lucene 5.2.0 based tool
Date Tue, 23 Feb 2016 09:40:50 GMT
Hi,

There is nothing you can improve in a single-threaded case. You can only parallelize to get
more out of it. Lucene is optimized to do parallel processing while indexing so you should
make use of that. 

> > > Note: After going through the code I found out that addDocument is
> > > internally calling updateDocument only. Is there any way by which we can
> > > avoid calling updateDocument and only use addDocument API?

Updating a document is deleting and reindexing a new one. So both share the same internal
logic, so it is perfectly fine to delegate internally. The only difference is that addDocument
just does not delete the previous one. It does this by passing null, which makes updateDocument
to not delete any previous document. So there is nothing to improve.

Uwe
 
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: sandeep das [mailto:yarnhadoop@gmail.com]
> Sent: Tuesday, February 23, 2016 8:30 AM
> To: java-user@lucene.apache.org
> Subject: Re: Profiling lucene 5.2.0 based tool
> 
> Hi Rob,
> 
> The statistics which I had shared were provided using one thread for
> indexing. I wish to use only 1 thread and want to process maximum
> 10MBps(Mega Bytes per second) of data rate. I believe with single thread it
> should be achievable.
> 
> Regards,
> Sandeep
> 
> On Tue, Feb 23, 2016 at 12:50 PM, Rob Audenaerde
> <rob.audenaerde@gmail.com>
> wrote:
> 
> > Hi Sandeep,
> >
> > How many threads do you use to do the indexing? The benchmarks of
> Lucene
> > are done on >20 threads IIRC.
> >
> > -Rob
> >
> > On Tue, Feb 23, 2016 at 8:01 AM, sandeep das <yarnhadoop@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I've implemented a tool using lucene-5.2.0 to index my CSV files. The
> > tool
> > > is reading data from CSV files(residing on disk) and creating indexes on
> > > local disk. It is able to process 3.5 MBps data. There are overall 46
> > > fields being added in one document. They are only of three data types 1.
> > > Integer, 2. Long, 3. String.
> > > All these fields are part of one CSV record and they are parsed using
> > > custom CSV parser which is faster than any split method of string.
> > >
> > > I've configured the following parameters to create indexWriter
> > > 1. setOpenMode(OpenMode.CREATE)
> > > 2. setCommitOnClose(true)
> > > 3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
> > > almost same.
> > >
> > > I've read over several blogs that lucene works way faster than these
> > > figures. So, I thought there are some bottlenecks in my code and profiled
> > > it using jvisualvm. The application is spending most of the time in
> > > DefaultIndexChain.processField i.e. 53% of total time.
> > >
> > >
> > > Following is the split of CPU usage in this application:
> > > 1. reading data from disk is taking 5% of total duration
> > > 2. adding document is taking 93% of total duration.
> > >
> > >    -    postUpdate  -> 12.8%
> > >    -    doAfterDocument -> 20.6%
> > >    -    updateDocument  -> 59.8%
> > >       - finishDocument -> 1.7%
> > >       - finishStoreFields -> 4.8%
> > >       - processFields -> 53.1%
> > >
> > >
> > > I'm also attaching the screen shot of call graph generated by jvisualvm.
> > >
> > > I've taken care of following points:
> > > 1. create only one instance of indexWriter
> > > 2. create only one instance of document and reuse it through out the life
> > > time of application
> > > 3. There will be no update in the documents hence only addDocument is
> > > invoked.
> > > Note: After going through the code I found out that addDocument is
> > > internally calling updateDocument only. Is there any way by which we can
> > > avoid calling updateDocument and only use addDocument API?
> > > 4. Using setValue APIs to set the pre created fields and reusing these
> > > fields to create indexes.
> > >
> > > Any tip to improve the performance will be immensely appreciated.
> > >
> > > Regards,
> > > Sandeep
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message