Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 61693 invoked from network); 12 Nov 2007 01:50:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Nov 2007 01:50:27 -0000 Received: (qmail 26261 invoked by uid 500); 12 Nov 2007 01:50:08 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 26229 invoked by uid 500); 12 Nov 2007 01:50:08 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 26218 invoked by uid 99); 12 Nov 2007 01:50:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Nov 2007 17:50:08 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com designates 72.14.202.176 as permitted sender) Received: from [72.14.202.176] (HELO ro-out-1112.google.com) (72.14.202.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Nov 2007 01:50:08 +0000 Received: by ro-out-1112.google.com with SMTP id m6so18078roe for ; Sun, 11 Nov 2007 17:49:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; bh=PbMkf2FXu8ytJMG62ZvMGubILmtnFKvwDa4iuuL0zfY=; b=KQfFfsvDmI7s2++f/YMXxwYf5GHNoKaa24JaqGzgO+PhLunORE6wH4FEBf9ljgN9fqW416sgKUEZimJgyspy3RXFOnyG90gtYiIwPs21h6rx9O7fulXvXNgQgB/u7/9NDhW7kPbFnivtVDfuhB2HPvkMheIy/HO4GlWMKdmSwgU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=auxvAKE0IOyTYlpYKNk0CGiG57O84GM0qrTCc0h3KcAuqWxw1WsMX1x/RUUvuXXeRLBaPQzERcthbKMk7oKkDBxfLRBt11v/6/nHRORkN9JPYfO9C3gF7n2bl1aBDu2IoO12DCySl345YvPsyih/tnFFBWupBmJvUPaQ3Zu0eiw= Received: by 10.100.44.4 with SMTP id r4mr6479050anr.1194832186473; Sun, 11 Nov 2007 17:49:46 -0800 (PST) Received: from ?192.168.1.102? ( [69.124.234.183]) by mx.google.com with ESMTPS id 66sm5311120wra.2007.11.11.17.49.45 (version=SSLv3 cipher=RC4-MD5); Sun, 11 Nov 2007 17:49:45 -0800 (PST) Message-ID: <4737B134.1040002@gmail.com> Date: Sun, 11 Nov 2007 20:49:40 -0500 From: Mark Miller User-Agent: Thunderbird 2.0.0.6 (Windows/20070728) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Optimizing index takes too long References: <525c024b0711111516o255e13fyf017751419f5e184@mail.gmail.com> <5F1985B0-6F4C-4274-B188-521312C2E018@apache.org> <525c024b0711111605s1aa8ff56hbea8d739f1cbae4f@mail.gmail.com> In-Reply-To: <525c024b0711111605s1aa8ff56hbea8d739f1cbae4f@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org For a start, I would lower the merge factor quite a bit. A high merge factor is over rated :) You will build the index faster, but searches will be slower and an optimize takes much longer. Essentially, the time you save when indexing is paid when optimizing anyway. You might as well amortize the cost with a lower merge factor. Grant seems to think the numbers are off anyway, so you may have more to do -- just a suggestion about the merge factor. How much RAM are you giving your application? With a machine with 8 cores and 15,000rpm, days does seem a little ridiculous. - Mark Barry Forrest wrote: > Hi, > > Thanks for your help. > > I'm using Lucene 2.3. > > Raw document size is about 138G for 1.5M documents, which is about > 250k per document. > > IndexWriter settings are MergeFactor 50, MaxMergeDocs 2000, > RAMBufferSizeMB 32, MaxFieldLength Integer.MAX_VALUE. > > Each document has about 10 short bibliographic fields and 3 longer > content fields and 1 field that contains the entire contents of the > document. The longer content fields are stored twice - in a stemmed > and unstemmed form. So actually there are about 8 longer content > fields. (The effect of storing stemmed and unstemmed versions is to > approximately double the index size over storing the content only > once). About half the short bibliographic fields are stored > (compressed) in the index. The longer content fields are not stored, > and no term vectors are stored. > > The hardware is quite new and fast: 8 cores, 15,000 RPM disks. > > Thanks again > Barry > > On Nov 12, 2007 10:41 AM, Grant Ingersoll wrote: > >> Hmmm, something doesn't sound quite right. You have 10 million docs, >> split into 5 or so indexes, right? And each sub index is 150 >> gigabytes? How big are your documents? >> >> Can you provide more info about what your Directory and IndexWriter >> settings are? What version of Lucene are you using? What are your >> Field settings? Are you storing info? What about Term Vectors? >> >> Can you explain more about your documents, etc? 10 million doesn't >> sound like it would need to be split up that much, if at all, >> depending on your hardware. >> >> The wiki has some excellent resources on improving both indexing and >> search speed. >> >> -Grant >> >> >> >> On Nov 11, 2007, at 6:16 PM, Barry Forrest wrote: >> >> >>> Hi, >>> >>> Optimizing my index of 1.5 million documents takes days and days. >>> >>> I have a collection of 10 million documents that I am trying to index >>> with Lucene. I've divided the collection into chunks of about 1.5 - 2 >>> million documents each. Indexing 1.5 documents is fast enough (about >>> 12 hours), but this results in an index directory containing about >>> 35000 files. Optimizing this index takes several days, which is a bit >>> too long for my purposes. Each sub-index is about 150G. >>> >>> What can I do to make this process faster? >>> >>> Thanks for your help, >>> Barry >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org