Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 37773 invoked from network); 25 Mar 2002 13:15:11 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 25 Mar 2002 13:15:11 -0000 Received: (qmail 8439 invoked by uid 97); 25 Mar 2002 13:15:10 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 8405 invoked by uid 97); 25 Mar 2002 13:15:09 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 8394 invoked from network); 25 Mar 2002 13:15:08 -0000 X-MimeOLE: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: Speed of indexing Date: Mon, 25 Mar 2002 08:08:42 -0500 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Speed of indexing Thread-Index: AcHTbZZybStla3OiR3GRT9lsO6DRXg== From: "David Elworthy" To: X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I was wondering if there are tricks for making indexing faster in Lucene. I have a program which reads XML "documents" from a file, and indexes the 7 or so fields which occur in them. Most of the fields are very short, and the one long one averages a few hundred words. To index 20000 such records takes 615 seconds. I use an IndexWriter with a String as the first argument, i.e. indexing directly to disc. If I change the mergeFactor to 100, the time drops to 275 seconds. At 1000, it drops to 249s. These times are not bad in absolute terms, but the 20000 records represents only about 2% of my data, so indexing the whole lot takes many hours. Using java -Xprof and mergeFactor=3D10, the = biggest consumers of processing time are: 22.2% 5 + 13172 java.io.RandomAccessFile.open 16.1% 4 + 9567 java.io.RandomAccessFile.close 13.3% 4 + 7880 java.io.RandomAccessFile.readBytes 8.1% 5 + 4818 java.io.RandomAccessFile.writeBytes 7.2% 4293 + 9 org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove Nfa_0 5.8% 5 + 3426 java.io.Win32FileSystem.delete I believe all of these are calls from Lucene as I don't use any of the above methods in my own code. readBytes and writeBytes I can believe, but why so much time on open and close? Incidentally with mergeFactor=3D1000, the biggest consumers are 29.7% 0 + 6729 java.io.RandomAccessFile.readBytes 19.0% 4296 + 12 org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove Nfa_0 As a point of comparison, I tried AltaVista's Java SDK (Nov 2000 release). I have a generic indexer program which differs only in the specific indexing calls for AV and Lucene. For the same 20000 records, it took only 57 seconds. This, I feel, does not speak well to Doug's comment in the Lucene FAQ that indexing in Lucene is very fast. If anyone has ideas for making it faster, I'd be interested to hear them. -- David Elworthy -- To unsubscribe, e-mail: For additional commands, e-mail: