Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of simon.willnauer@googlemail.com
 designates 209.85.212.48 as permitted sender)
MIME-Version: 1.0
Reply-To: simon.willnauer@gmail.com
In-Reply-To: <4F0394D7.904@yahoo.de>
References: <4F033336.1030700@yahoo.de>
	<CAAHmpkgWrt9nO3Z+2x_EGd4vOOopaRma4tY=-ttPgo4wbLQetg@mail.gmail.com>
	<4F0394D7.904@yahoo.de>
Date: Thu, 5 Jan 2012 09:21:17 +0100
Message-ID: 
 <CAAHmpki7+iH8mDxVsfO8Sh-mJTJkaPgs7sPE4eFc63QepWLmNw@mail.gmail.com>
Subject: Re: Comparing Indexing Speed of Lucene 3.5 and 4.0
From: Simon Willnauer <simon.willnauer@googlemail.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

hey peter,

On Wed, Jan 4, 2012 at 12:52 AM, Peter K <peathal@yahoo.de> wrote:
> Thanks Simon for you answer!
>
>> as far as I can see you are comparing apples and pears.
>
> When excluding the waiting time I also get the slight but reproducable
> difference**. The times for waitForGeneration are nearly the same
> (~2sec). Also when I commit instead waitForGeneration it is no
> difference. Would you mind to give me some more hints/explanations and
> I'll try to digg deeper :) !
>
>> Your comparison is waiting for merges to finish and if you are using mul=
tiple threads lucene 4.0 will flush more segments to disk than 3.5
>
> It does not seem to be an 'IO related issue' because using RAMDirectory
> results in the same times.
> And indexing via Luc4 with only one thread shouldn't be slower than 3.5 (=
?)

it could be since we use a different term dictionary impl which is
more expensive in building than the previous versions; thats just a
guess.
What I am really wondering is why you are using the NRT manager and
reopen during indexing - are you measuring the NRT reopen times too? -
maybe you can run your tests without NRT support, just plain indexing
What merge policies are you using for 3x and 4x?


>
>
>> You should add some more randomness or reality to your test.
>
> Hmmh, ok. The uid and type is the reality in my other (experimental)
> project as it uses a generated and incremented id from AtomicLong and
> two types.
> Or do you have an explanation why luc4 can be slower on such 'simple'
> fields?

you reported that indexing only the ID is faster in 4.x but the other
fields AFAIK are likely always the same for all docs, no? maybe there
is some weirdness that the term dict takes longer on those kind of
inputs?

>
> Could it be due to some garbage collector or thread overhead with luc4?
> As I see a bigger execution speed variation for single lucene 4.0 runs
> (differences of seconds!) than for 3.5 (differences in 0.1seconds!).
> E.g. how could I try to reduce those/some threads?

you are indexing with one thread right? I mean my benchmarks show up
to 300% improvement with 4.x versus older versions so something is
weird ie. non-realistic here or there is a bug so lets figure this
out. Can you profile you app and see if you find something suspicious?
I'd also try to index way more documents to make your benchmarks run
little longer just to be sure.

simon
>
> Regards,
> Peter.
>
>
>
> **
> sw =3D new StopWatch("perf" + trial).start();
> for (int i =3D 0; i < items; i++) {
> =C2=A0 =C2=A0innerRun(trial, i);
> }
> float indexingTime =3D sw.stop().getSeconds();
>
>
> // luc4.0
> @Override public void innerRun(int trial, int i) {
> =C2=A0 =C2=A0long id =3D i;
> =C2=A0 =C2=A0Document newDoc =3D new Document();
> =C2=A0 =C2=A0NumericField idField =3D new NumericField("_id", 6, Field.St=
ore.YES,
> true).setLongValue(id);
> =C2=A0 =C2=A0Field uIdField =3D new Field("_uid", "" + id, Field.Store.YE=
S,
> Field.Index.NOT_ANALYZED_NO_NORMS);
> =C2=A0 =C2=A0uIdField.setIndexOptions(IndexOptions.DOCS_ONLY);
> =C2=A0 =C2=A0Field typeField =3D new Field("_type", "test", Field.Store.Y=
ES,
> Field.Index.NOT_ANALYZED_NO_NORMS);
> =C2=A0 =C2=A0typeField.setIndexOptions(IndexOptions.DOCS_ONLY);
> =C2=A0 =C2=A0newDoc.add(idField);
> =C2=A0 =C2=A0newDoc.add(uIdField);
> =C2=A0 =C2=A0newDoc.add(typeField);
> =C2=A0 =C2=A0try {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0String longStr =3D NumericUtils.longToPrefixCo=
ded(id);
> =C2=A0 =C2=A0 =C2=A0 =C2=A0latestGen =3D nrtManager.updateDocument(new Te=
rm("_id", longStr),
> newDoc);
> =C2=A0 =C2=A0 =C2=A0 =C2=A0docs++;
> =C2=A0 =C2=A0} catch (IOException ex) {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0logger.error("Cannot update " + i, ex);
> =C2=A0 =C2=A0}
> }
>
>
> // luc3.5
> @Override public void innerRun(int trial, int i) {
> =C2=A0 =C2=A0long id =3D i;
> =C2=A0 =C2=A0Document newDoc =3D new
> Document();
> =C2=A0 =C2=A0NumericField idField =3D new NumericField("_id", 6,
> NumericField.TYPE_STORED).setLongValue(id);
> =C2=A0 =C2=A0Field uIdField =3D new Field("_uid", "" + id, StringField.TY=
PE_STORED);
> =C2=A0 =C2=A0Field typeField =3D new Field("_type", "test", StringField.T=
YPE_STORED);
> =C2=A0 =C2=A0newDoc.add(idField);
> =C2=A0 =C2=A0newDoc.add(uIdField);
> =C2=A0 =C2=A0newDoc.add(typeField);
> =C2=A0 =C2=A0try {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0// problem when reusing: nrt thread and this t=
hread access the
> same bytes at the same time!
> =C2=A0 =C2=A0 =C2=A0 =C2=A0final BytesRef bytes =3D new BytesRef();
> =C2=A0 =C2=A0 =C2=A0 =C2=A0NumericUtils.longToPrefixCoded(id, 0, bytes);
> =C2=A0 =C2=A0 =C2=A0 =C2=A0latestGen =3D nrtManager.updateDocument(new Te=
rm("_id", bytes),
> newDoc);
> =C2=A0 =C2=A0 =C2=A0 =C2=A0docs++;
> =C2=A0 =C2=A0} catch (IOException ex) {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0logger.error("Cannot update " + i, ex);
> =C2=A0 =C2=A0}
> }
>
>> hey Peter,
>>
>> as far as I can see you are comparing apples and pears. Your
>> comparison is waiting for merges to finish and if you are using
>> multiple threads lucene 4.0 will flush more segments to disk than 3.5
>> so what you are seeing is likely a merge that is still trying to merge
>> small segments. can you rerun and only measure the time until the last
>> commit finishes (not the close)
>>
>> one more thing, you are indexing always the more or less same document
>> and the text is very very short. You should add some more randomness
>> or reality to your test.
>>
>> simon
>>
>> On Tue, Jan 3, 2012 at 5:56 PM, Peter K <peathal@yahoo.de> wrote:
>>> Hi,
>>>
>>> I recently switched an experimental project from Lucene 3.5 to 4.0 from
>>> 6th Dec 2011
>>> and my indexing time increased by nearly 20% on my local machine*.
>>> It seems to me that two simple StringField's could cause this slow down=
:
>>> Field uIdField =3D new Field("_uid", "" + id, StringField.TYPE_STORED);
>>> Field typeField =3D new Field("_type", "test", StringField.TYPE_STORED)=
;
>>>
>>> Without them Lucene 4 is faster**. Here is a recreation using different
>>> branches for every lucene version:
>>> https://github.com/karussell/lucene-tmp
>>> Or is there something wrong with my too simplistic scenario?
>>>
>>> Furthermore: How could I further improve Lucene 4.0 indexing speed?
>>> (I already read through the performance list on the wiki)
>>>
>>> Regards,
>>> Peter.
>>>
>>> *
>>> open jdk 1.6.0_20 =C2=A0(but also confirmed with latest java6 from orac=
le)
>>> ubuntu/10.10 linux/2.6.35-31 i686, 2GB ram
>>>
>>> **
>>> lucene 3.5
>>> 23.5sec index all three fields: _id, _uid, type
>>> 19.0sec index only the _id field
>>>
>>> lucene 4
>>> 29.5sec index _id, _uid, type
>>> 16.5sec index only the _id
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org