lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Comparing Indexing Speed of Lucene 3.5 and 4.0
Date Sat, 07 Jan 2012 13:03:08 GMT
Hi,

> >  I mean my benchmarks show up
> > to 300% improvement with 4.x versus older versions so something is
> > weird ie. non-realistic here or there is a bug so lets figure this
> > out. Can you profile you app and see if you find something suspicious?
> > I'll try now and report back.
> 
> It seems to be largely my mistake: maven enables assertions automatically
> when running tests.
> Executing it as normal public main class results in faster indexing times for 4.0
> compared to 3.5.
> 
> Conclusion:
> 1. execution with assertions for 4.0 is slower than 3.5 (thats what I mainly
> measured :/)

Die, Maven, die :-)

> 2. luc 4.0 execution times vary more than 3.5 when using reopen thread (and
> one single indexing thread, others not tested).
> 3. luc 4.0 then is still slower, but for 5 mio of my items its less then 5%.
>  The hot spots are:
>  * 30% ThreadAffinityDocumentsWriterThreadPool ->
> java.util.concurrent.ConcurrentHashMap.get(Object) -> threadBindings.get
>  * 26% BufferedDeletesStream.applyTermDeletes(Iterable, SegmentReader)
>  * 16% FreqProxTermsWriterPerField.flush(String, FieldsConsumer,
> SegmentWriteState)
>  * 10% DocFieldProcessor.processDocument
> 
> Now when reusing BytesRef in 4.0 (and reusing the char array in 3.5) then luc 4
> is >20% faster than 3.5 for 5 mio docs!

You can only reuse the BytesRef (I assume the one to encode the numeric key to delete the
document) from within the same thread! I see no other BytesRef use in your code. If you reuse
the BytesRef, you can also reuse all Fields and Documents - but only within the same thread.

> But somewhen I had problems as a thread concurrently modified the docs - can
> this happen e.g. from the reopen thread? Or is it safe to reuse BytesRef?

In one thread: yes!

Uwe

> Regards,
> Peter.
> 
> 
> 
> 
> > Hi Simon,
> >
> > answers below.
> >
> >>> It does not seem to be an 'IO related issue' because using RAMDirectory
> >>> results in the same times.
> >>> And indexing via Luc4 with only one thread shouldn't be slower than 3.5
(?)
> >> it could be since we use a different term dictionary impl which is
> >> more expensive in building than the previous versions; thats just a
> >> guess.
> >> What I am really wondering is why you are using the NRT manager and
> >> reopen during indexing - are you measuring the NRT reopen times too?
> > My project requires reopening as it will then clear some caches.
> >
> > Reopening isn't that frequent (every 5 seconds). When disabling it the
> > difference even increases slightly, but the big variation for luc4 goes
> > away!
> >
> >
> >> What merge policies are you using for 3x and 4x?
> > The default ones. I'm now using LogByteSizeMergePolicy for both but it
> > is nearly the same difference.
> >
> >
> >>>> You should add some more randomness or reality to your test.
> >>> Hmmh, ok. The uid and type is the reality in my other (experimental)
> >>> project as it uses a generated and incremented id from AtomicLong and
> >>> two types.
> >>> Or do you have an explanation why luc4 can be slower on such 'simple'
> >>> fields?
> >> you reported that indexing only the ID is faster in 4.x but the other
> >> fields AFAIK are likely always the same for all docs, no?
> > no, the _uid field is different: it's the id field converted to string.
> >
> >
> >> you are indexing with one thread right?
> > yes.
> >
> >
> >>  I mean my benchmarks show up
> >> to 300% improvement with 4.x versus older versions so something is
> >> weird ie. non-realistic here or there is a bug so lets figure this
> >> out. Can you profile you app and see if you find something suspicious?
> > I'll try now and report back.
> >
> >
> >> I'd also try to index way more documents to make your benchmarks run
> >> little longer just to be sure.
> > For ~5 times more docs (5 mio) it is nearly the same difference.
> >
> >
> > Regards,
> > Peter.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message