lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter K <peat...@yahoo.de>
Subject Re: Comparing Indexing Speed of Lucene 3.5 and 4.0
Date Sat, 07 Jan 2012 12:48:45 GMT
>  I mean my benchmarks show up
> to 300% improvement with 4.x versus older versions so something is
> weird ie. non-realistic here or there is a bug so lets figure this
> out. Can you profile you app and see if you find something suspicious?
> I'll try now and report back.

It seems to be largely my mistake: maven enables assertions automatically when running tests.
Executing it as normal public main class results in faster indexing times for 4.0 compared
to 3.5.

Conclusion:
1. execution with assertions for 4.0 is slower than 3.5 (thats what I mainly measured :/)
2. luc 4.0 execution times vary more than 3.5 when using reopen thread (and one single indexing
thread, others not tested).
3. luc 4.0 then is still slower, but for 5 mio of my items its less then 5%.
 The hot spots are:
 * 30% ThreadAffinityDocumentsWriterThreadPool -> java.util.concurrent.ConcurrentHashMap.get(Object)
-> threadBindings.get
 * 26% BufferedDeletesStream.applyTermDeletes(Iterable, SegmentReader)
 * 16% FreqProxTermsWriterPerField.flush(String, FieldsConsumer, SegmentWriteState)
 * 10% DocFieldProcessor.processDocument

Now when reusing BytesRef in 4.0 (and reusing the char array in 3.5) then luc 4 is >20%
faster than 3.5 for 5 mio docs!
But somewhen I had problems as a thread concurrently modified the docs - can this happen e.g.
from the reopen thread? Or is it safe to reuse BytesRef?

Regards,
Peter.




> Hi Simon,
>
> answers below.
>
>>> It does not seem to be an 'IO related issue' because using RAMDirectory
>>> results in the same times.
>>> And indexing via Luc4 with only one thread shouldn't be slower than 3.5 (?)
>> it could be since we use a different term dictionary impl which is
>> more expensive in building than the previous versions; thats just a
>> guess.
>> What I am really wondering is why you are using the NRT manager and
>> reopen during indexing - are you measuring the NRT reopen times too?
> My project requires reopening as it will then clear some caches.
>
> Reopening isn't that frequent (every 5 seconds). When disabling it the
> difference even increases slightly, but the big variation for luc4 goes
> away!
>
>
>> What merge policies are you using for 3x and 4x?
> The default ones. I'm now using LogByteSizeMergePolicy for both but it
> is nearly the same difference.
>
>
>>>> You should add some more randomness or reality to your test.
>>> Hmmh, ok. The uid and type is the reality in my other (experimental)
>>> project as it uses a generated and incremented id from AtomicLong and
>>> two types.
>>> Or do you have an explanation why luc4 can be slower on such 'simple'
>>> fields?
>> you reported that indexing only the ID is faster in 4.x but the other
>> fields AFAIK are likely always the same for all docs, no?
> no, the _uid field is different: it's the id field converted to string.
>
>
>> you are indexing with one thread right?
> yes.
>
>
>>  I mean my benchmarks show up
>> to 300% improvement with 4.x versus older versions so something is
>> weird ie. non-realistic here or there is a bug so lets figure this
>> out. Can you profile you app and see if you find something suspicious?
> I'll try now and report back.
>
>
>> I'd also try to index way more documents to make your benchmarks run
>> little longer just to be sure.
> For ~5 times more docs (5 mio) it is nearly the same difference.
>
>
> Regards,
> Peter.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message