Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 30774 invoked from network); 7 Jul 2004 17:01:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 7 Jul 2004 17:01:32 -0000 Received: (qmail 84420 invoked by uid 500); 7 Jul 2004 16:54:53 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 84233 invoked by uid 500); 7 Jul 2004 16:54:51 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 84080 invoked by uid 99); 7 Jul 2004 16:54:48 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.10.110.95] (HELO londo.swishmail.com) (209.10.110.95) by apache.org (qpsmtpd/0.27.1) with ESMTP; Wed, 07 Jul 2004 09:54:47 -0700 Received: (qmail 61215 invoked by uid 89); 7 Jul 2004 16:54:38 -0000 Received: from unknown (HELO ?192.168.168.81?) (postmaster@cottrell-cutting.net@24.5.163.156) by londo.swishmail.com with AES256-SHA encrypted SMTP; 7 Jul 2004 16:54:38 -0000 Message-ID: <40EC2AC5.1010504@apache.org> Date: Wed, 07 Jul 2004 09:54:29 -0700 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040619 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: Most efficient way to index 14M documents (out of memory/file handles) References: <40EB8DC8.9040007@newsmonster.org> In-Reply-To: <40EB8DC8.9040007@newsmonster.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N A mergeFactor of 5000 is a bad idea. If you want to index faster, try increasing minMergeDocs instead. If you have lots of memory this can probably be 5000 or higher. Also, why do you optimize before you're done? That only slows things. Perhaps you have to do it because you've set mergeFactor to such an extreme value? I do not recommend a merge factor higher than 100. Doug Kevin A. Burton wrote: > I'm trying to burn an index of 14M documents. > > I have two problems. > > 1. I have to run optimize() every 50k documents or I run out of file > handles. this takes TIME and of course is linear to the size of the > index so it just gets slower by the time I complete. It starts to crawl > at about 3M documents. > > 2. I eventually will run out of memory in this configuration. > > I KNOW this has been covered before but for the life of me I can't find > it in the archives, the FAQ or the wiki. > I'm using an IndexWriter with a mergeFactor of 5k and then optimizing > every 50k documents. > > Does it make sense to just create a new IndexWriter for every 50k docs > and then do one big optimize() at the end? > > Kevin > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org