Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: Speed of indexing
Date: Mon, 25 Mar 2002 08:08:42 -0500
Message-ID: <F9045D91D27355468F4DAEA01860F07A07521C@ULYSSES.lingomotors.com>
Thread-Topic: Speed of indexing
Thread-Index: AcHTbZZybStla3OiR3GRT9lsO6DRXg==
From: "David Elworthy" <dahe@lingomotors.com>
To: <lucene-user@jakarta.apache.org>

I was wondering if there are tricks for making indexing faster in
Lucene. I have a program which reads XML "documents" from a file, and
indexes the 7 or so fields which occur in them. Most of the fields are
very short, and the one long one averages a few hundred words.

To index 20000 such records takes 615 seconds. I use an IndexWriter with
a String as the first argument, i.e. indexing directly to disc. If I
change the mergeFactor to 100, the time drops to 275 seconds. At 1000,
it drops to 249s. These times are not bad in absolute terms, but the
20000 records represents only about 2% of my data, so indexing the whole
lot takes many hours. Using java -Xprof and mergeFactor=3D10, the =
biggest
consumers of processing time are:
 22.2%     5  + 13172    java.io.RandomAccessFile.open
 16.1%     4  +  9567    java.io.RandomAccessFile.close
 13.3%     4  +  7880    java.io.RandomAccessFile.readBytes
  8.1%     5  +  4818    java.io.RandomAccessFile.writeBytes
  7.2%  4293  +     9
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
Nfa_0
  5.8%     5  +  3426    java.io.Win32FileSystem.delete

I believe all of these are calls from Lucene as I don't use any of the
above methods in my own code. readBytes and writeBytes I can believe,
but why so much time on open and close? Incidentally with
mergeFactor=3D1000, the biggest consumers are
 29.7%     0  +  6729    java.io.RandomAccessFile.readBytes
 19.0%  4296  +    12
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
Nfa_0


As a point of comparison, I tried AltaVista's Java SDK (Nov 2000
release). I have a generic indexer program which differs only in the
specific indexing calls for AV and Lucene. For the same 20000 records,
it took only 57 seconds. This, I feel, does not speak well to Doug's
comment in the Lucene FAQ that indexing in Lucene is very fast. If
anyone has ideas for making it faster, I'd be interested to hear them.

-- David Elworthy

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>