Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 79652 invoked from network); 7 Jul 2004 16:20:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 7 Jul 2004 16:20:35 -0000 Received: (qmail 38306 invoked by uid 500); 7 Jul 2004 09:58:32 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 38286 invoked by uid 500); 7 Jul 2004 09:58:31 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 38259 invoked by uid 99); 7 Jul 2004 09:58:31 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [69.44.16.11] (HELO getopt.org) (69.44.16.11) by apache.org (qpsmtpd/0.27.1) with ESMTP; Wed, 07 Jul 2004 02:58:30 -0700 Received: from [192.168.0.254] (75-mo3-2.acn.waw.pl [62.121.105.75]) (authenticated) by getopt.org (8.11.6/8.11.6) with ESMTP id i679vrI05356 for ; Wed, 7 Jul 2004 04:57:53 -0500 Message-ID: <40EBC921.4000209@getopt.org> Date: Wed, 07 Jul 2004 11:57:53 +0200 From: Andrzej Bialecki User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040608 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: Most efficient way to index 14M documents (out of memory/file handles) References: <200407070723.i677NJU9023504@outmail.freedom2surf.net> In-Reply-To: <200407070723.i677NJU9023504@outmail.freedom2surf.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N markharw00d@yahoo.co.uk wrote: > A colleague of mine found the fastest way to index was to use a RAMDirectory, letting it grow > to a pre-defined maximum size, then merging it to a new temporary file-based index to > flush it. Repeat this, creating new directories for all the file based indexes then perform > a merge into one index once all docs are indexed. > > I haven't managed to test this for myself but my colleague says he noticed a > considerable speed up by merging once at the end with this approach so you may want > to give it a try. (This was with Lucene 1.3) I can confirm that this approach works quite well - I use it myself in some applications, both with Lucene 1.3 and 1.4. The disadvantage is of course that the memory consumption goes up, so you have to be careful to cap the max size of RAMDirectory according to your max heap size limits. -- Best regards, Andrzej Bialecki ------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org) --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org