Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 25369 invoked from network); 3 May 2007 14:31:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 May 2007 14:31:27 -0000 Received: (qmail 57743 invoked by uid 500); 3 May 2007 14:31:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 57716 invoked by uid 500); 3 May 2007 14:31:26 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 57705 invoked by uid 99); 3 May 2007 14:31:26 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 May 2007 07:31:26 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 66.249.92.169 as permitted sender) Received: from [66.249.92.169] (HELO ug-out-1314.google.com) (66.249.92.169) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 May 2007 07:31:18 -0700 Received: by ug-out-1314.google.com with SMTP id m2so451723uge for ; Thu, 03 May 2007 07:30:57 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=DlZNG8tl6v1D61Ayn+bD6AS0rYMPM56B00nWOqrFypMCcqr7uysxtGu46ozMfscsbXN03OjraYv9bumJsKv14OokuLIu01BxT7vkfJWlM+xNttd2rgtUfWnRZdjXNRyAKE1AxCr5O+5gMr0C2cJqpBibKKWhmrGHMCK1lygJGfI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=rO3j6TN2S8wxl0JeZgEl7Ilj7SgsXDddoqYTo2UZ/NMn7aSeycw9i/ht2if0lziwg6cInRMTsMIlADsgCmfdg4oHTbqUGrDn0K6JThKD4a6l/pe3ym6hffIbdwMLCoqyfdycvixlQ/By+yV2x2/6cMf+M6n6xCu2Q/o9ugINjvA= Received: by 10.82.189.6 with SMTP id m6mr4105220buf.1178202651329; Thu, 03 May 2007 07:30:51 -0700 (PDT) Received: by 10.82.190.7 with HTTP; Thu, 3 May 2007 07:30:51 -0700 (PDT) Message-ID: <359a92830705030730y35348ff6nbe4b30be63621c30@mail.gmail.com> Date: Thu, 3 May 2007 10:30:51 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org, aleksander.stensby@integrasco.no Subject: Re: MergeFactor advice wanted In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_11287_31626095.1178202651277" References: <4639B6C4.1060709@gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_11287_31626095.1178202651277 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline I don't think (but don't know for sure) whether optimizing before the end of the run buys you anything. And you're right, it takes a while. I've assumed that it was best done at the end of the entire run, but that's only an assumption. Search the archives for the thread titled MergeFactor and MaxBufferedDocs value should ...? for an exposition on how all the indexing factors relate. Also, look at the (new to 2.1) IndexWriter.ramSizeInBytes() (or something like that). Rather than worrying about MergeFactor and the other parameters and guessing, this may allow you to flush RAM to disk when needed rather than every N documents. CAUTION: there's a bug (detailed in the thread referenced above) with this code, so look at the thread... In fact, yesterday I experimented with what I call an "AdaptiveWriter". The idea is to query the op system for the amount of RAM I'm using, and track the amount of RAM used to index a document as it relates to size. Then flush when I get "too close for comfort" to an OOM error. Yes, this is a digression from optimizing, but it is related to indexing as fast as possible . By monitoring the size of the index growth and the size of the incoming document, I can create a crude measure of how much RAM the index needs for a document of a given size. Actually, I tracked the ratio of the size of incoming doc to the change in memory. When my available RAM for the process is less than 2X the largest ratio yet times the size of the incoming document, I flush. I'm not sure how much this changes things, but I thought that creating one of these was better than experimenting with the various factors for each new project... Anyway, do look at that thread for ideas on how to make this as efficient as possible, and you can probably ignore the rest ... Best Erick On 5/3/07, Aleksander M. Stensby wrote: > > Ok. but then you would not optimize at all? Not even in the end of the > indexing run? > > On Thu, 03 May 2007 12:17:40 +0200, Mark Miller > wrote: > > > I think it is worth your time to do some benchmarking. I think > > mergeFactor is not very helpful in the end...if you set it high, you'll > > index faster but then your searches will be slower prompting you to > > optimize...after which you'll find that you paid all your gains back. > > Test things out for yourself, but I'd recommend a low merge factor and > > then you can forget about the hassle of optimizing. Amortize, > amortize... > > > > - Mark > > > > Aleksander M. Stensby wrote: > >> Hello everyone! > >> I'm wondering if any of you have any helpful advice to what MergeFactor > >> i should use... > >> The indexing process is handling a large amount of documents and i > >> would like to index as fast as possible. > >> Initial thought was to increase the mergeFactor to make the indexer > >> work more in memory and less writing to file. Thus this created a > >> problem for me with "TOO-MANY-OPEN-FILES"... of course, since i choose > >> 2000 as my mergeFactor:) Well, i could do an optimize from time to > >> time, but the big question is whats more efficient? Optimize tends to > >> take a loooong time on our system since it is quite a large index. > >> > >> Any helpful advice to what i should do? 10 as mergeFactor cant possibly > >> be the best solution here? Any advice would be highly appreciated! > >> > >> - Aleksander > >> > >> --Aleksander M. Stensby > >> Software Developer > >> Integrasco A/S > >> aleksander.stensby@integrasco.no > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > -- > Aleksander M. Stensby > Software Developer > Integrasco A/S > aleksander.stensby@integrasco.no > Tlf.: +47 41 22 82 72 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_11287_31626095.1178202651277--