Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AECD19AA4 for ; Thu, 5 Jan 2012 08:22:06 +0000 (UTC) Received: (qmail 97301 invoked by uid 500); 5 Jan 2012 08:22:04 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 96851 invoked by uid 500); 5 Jan 2012 08:21:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 96826 invoked by uid 99); 5 Jan 2012 08:21:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 08:21:44 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of simon.willnauer@googlemail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 08:21:38 +0000 Received: by vbbfa15 with SMTP id fa15so257766vbb.35 for ; Thu, 05 Jan 2012 00:21:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type:content-transfer-encoding; bh=8nm1TmN7emIyOEBukjyBQfhY+K1E5EXZcDn5gZeK5u8=; b=oZuzVVxSMw5Gh1yQ1vFSMkgryT0I8bX6QwMktCkRSxYskSM0Loup6tON2NBU1B6sL4 hDYQ5hD34u47MRiDbBZvtqyaHKJL6HE49z130UNxKNm1Y9NiieNGrtuVJB12J24ca6Qb XNLTuRdRXi3I3UpwV7F1JW5NbPlMV6jVxvmfM= MIME-Version: 1.0 Received: by 10.52.180.98 with SMTP id dn2mr419185vdc.83.1325751677267; Thu, 05 Jan 2012 00:21:17 -0800 (PST) Received: by 10.52.174.72 with HTTP; Thu, 5 Jan 2012 00:21:17 -0800 (PST) Reply-To: simon.willnauer@gmail.com In-Reply-To: <4F0394D7.904@yahoo.de> References: <4F033336.1030700@yahoo.de> <4F0394D7.904@yahoo.de> Date: Thu, 5 Jan 2012 09:21:17 +0100 Message-ID: Subject: Re: Comparing Indexing Speed of Lucene 3.5 and 4.0 From: Simon Willnauer To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org hey peter, On Wed, Jan 4, 2012 at 12:52 AM, Peter K wrote: > Thanks Simon for you answer! > >> as far as I can see you are comparing apples and pears. > > When excluding the waiting time I also get the slight but reproducable > difference**. The times for waitForGeneration are nearly the same > (~2sec). Also when I commit instead waitForGeneration it is no > difference. Would you mind to give me some more hints/explanations and > I'll try to digg deeper :) ! > >> Your comparison is waiting for merges to finish and if you are using mul= tiple threads lucene 4.0 will flush more segments to disk than 3.5 > > It does not seem to be an 'IO related issue' because using RAMDirectory > results in the same times. > And indexing via Luc4 with only one thread shouldn't be slower than 3.5 (= ?) it could be since we use a different term dictionary impl which is more expensive in building than the previous versions; thats just a guess. What I am really wondering is why you are using the NRT manager and reopen during indexing - are you measuring the NRT reopen times too? - maybe you can run your tests without NRT support, just plain indexing What merge policies are you using for 3x and 4x? > > >> You should add some more randomness or reality to your test. > > Hmmh, ok. The uid and type is the reality in my other (experimental) > project as it uses a generated and incremented id from AtomicLong and > two types. > Or do you have an explanation why luc4 can be slower on such 'simple' > fields? you reported that indexing only the ID is faster in 4.x but the other fields AFAIK are likely always the same for all docs, no? maybe there is some weirdness that the term dict takes longer on those kind of inputs? > > Could it be due to some garbage collector or thread overhead with luc4? > As I see a bigger execution speed variation for single lucene 4.0 runs > (differences of seconds!) than for 3.5 (differences in 0.1seconds!). > E.g. how could I try to reduce those/some threads? you are indexing with one thread right? I mean my benchmarks show up to 300% improvement with 4.x versus older versions so something is weird ie. non-realistic here or there is a bug so lets figure this out. Can you profile you app and see if you find something suspicious? I'd also try to index way more documents to make your benchmarks run little longer just to be sure. simon > > Regards, > Peter. > > > > ** > sw =3D new StopWatch("perf" + trial).start(); > for (int i =3D 0; i < items; i++) { > =C2=A0 =C2=A0innerRun(trial, i); > } > float indexingTime =3D sw.stop().getSeconds(); > > > // luc4.0 > @Override public void innerRun(int trial, int i) { > =C2=A0 =C2=A0long id =3D i; > =C2=A0 =C2=A0Document newDoc =3D new Document(); > =C2=A0 =C2=A0NumericField idField =3D new NumericField("_id", 6, Field.St= ore.YES, > true).setLongValue(id); > =C2=A0 =C2=A0Field uIdField =3D new Field("_uid", "" + id, Field.Store.YE= S, > Field.Index.NOT_ANALYZED_NO_NORMS); > =C2=A0 =C2=A0uIdField.setIndexOptions(IndexOptions.DOCS_ONLY); > =C2=A0 =C2=A0Field typeField =3D new Field("_type", "test", Field.Store.Y= ES, > Field.Index.NOT_ANALYZED_NO_NORMS); > =C2=A0 =C2=A0typeField.setIndexOptions(IndexOptions.DOCS_ONLY); > =C2=A0 =C2=A0newDoc.add(idField); > =C2=A0 =C2=A0newDoc.add(uIdField); > =C2=A0 =C2=A0newDoc.add(typeField); > =C2=A0 =C2=A0try { > =C2=A0 =C2=A0 =C2=A0 =C2=A0String longStr =3D NumericUtils.longToPrefixCo= ded(id); > =C2=A0 =C2=A0 =C2=A0 =C2=A0latestGen =3D nrtManager.updateDocument(new Te= rm("_id", longStr), > newDoc); > =C2=A0 =C2=A0 =C2=A0 =C2=A0docs++; > =C2=A0 =C2=A0} catch (IOException ex) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0logger.error("Cannot update " + i, ex); > =C2=A0 =C2=A0} > } > > > // luc3.5 > @Override public void innerRun(int trial, int i) { > =C2=A0 =C2=A0long id =3D i; > =C2=A0 =C2=A0Document newDoc =3D new > Document(); > =C2=A0 =C2=A0NumericField idField =3D new NumericField("_id", 6, > NumericField.TYPE_STORED).setLongValue(id); > =C2=A0 =C2=A0Field uIdField =3D new Field("_uid", "" + id, StringField.TY= PE_STORED); > =C2=A0 =C2=A0Field typeField =3D new Field("_type", "test", StringField.T= YPE_STORED); > =C2=A0 =C2=A0newDoc.add(idField); > =C2=A0 =C2=A0newDoc.add(uIdField); > =C2=A0 =C2=A0newDoc.add(typeField); > =C2=A0 =C2=A0try { > =C2=A0 =C2=A0 =C2=A0 =C2=A0// problem when reusing: nrt thread and this t= hread access the > same bytes at the same time! > =C2=A0 =C2=A0 =C2=A0 =C2=A0final BytesRef bytes =3D new BytesRef(); > =C2=A0 =C2=A0 =C2=A0 =C2=A0NumericUtils.longToPrefixCoded(id, 0, bytes); > =C2=A0 =C2=A0 =C2=A0 =C2=A0latestGen =3D nrtManager.updateDocument(new Te= rm("_id", bytes), > newDoc); > =C2=A0 =C2=A0 =C2=A0 =C2=A0docs++; > =C2=A0 =C2=A0} catch (IOException ex) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0logger.error("Cannot update " + i, ex); > =C2=A0 =C2=A0} > } > >> hey Peter, >> >> as far as I can see you are comparing apples and pears. Your >> comparison is waiting for merges to finish and if you are using >> multiple threads lucene 4.0 will flush more segments to disk than 3.5 >> so what you are seeing is likely a merge that is still trying to merge >> small segments. can you rerun and only measure the time until the last >> commit finishes (not the close) >> >> one more thing, you are indexing always the more or less same document >> and the text is very very short. You should add some more randomness >> or reality to your test. >> >> simon >> >> On Tue, Jan 3, 2012 at 5:56 PM, Peter K wrote: >>> Hi, >>> >>> I recently switched an experimental project from Lucene 3.5 to 4.0 from >>> 6th Dec 2011 >>> and my indexing time increased by nearly 20% on my local machine*. >>> It seems to me that two simple StringField's could cause this slow down= : >>> Field uIdField =3D new Field("_uid", "" + id, StringField.TYPE_STORED); >>> Field typeField =3D new Field("_type", "test", StringField.TYPE_STORED)= ; >>> >>> Without them Lucene 4 is faster**. Here is a recreation using different >>> branches for every lucene version: >>> https://github.com/karussell/lucene-tmp >>> Or is there something wrong with my too simplistic scenario? >>> >>> Furthermore: How could I further improve Lucene 4.0 indexing speed? >>> (I already read through the performance list on the wiki) >>> >>> Regards, >>> Peter. >>> >>> * >>> open jdk 1.6.0_20 =C2=A0(but also confirmed with latest java6 from orac= le) >>> ubuntu/10.10 linux/2.6.35-31 i686, 2GB ram >>> >>> ** >>> lucene 3.5 >>> 23.5sec index all three fields: _id, _uid, type >>> 19.0sec index only the _id field >>> >>> lucene 4 >>> 29.5sec index _id, _uid, type >>> 16.5sec index only the _id >>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org