Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 6424 invoked from network); 25 Mar 2007 15:08:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Mar 2007 15:08:03 -0000 Received: (qmail 75935 invoked by uid 500); 25 Mar 2007 15:08:04 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 75903 invoked by uid 500); 25 Mar 2007 15:08:03 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 75891 invoked by uid 99); 25 Mar 2007 15:08:03 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Mar 2007 08:08:03 -0700 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_10_20,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 209.85.132.248 as permitted sender) Received: from [209.85.132.248] (HELO an-out-0708.google.com) (209.85.132.248) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Mar 2007 08:07:55 -0700 Received: by an-out-0708.google.com with SMTP id c3so1620504ana for ; Sun, 25 Mar 2007 08:07:34 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=UnxFgDiorprepgXo5ddLxkCoRqi/NjU+Kiie1wlyZuPQTOprhy9l1ASZNC3Va3685rgKzgxhhxlbG40gyp2SrdRDppqJ1vtf0wNTf1KluqPj39rDUza3YAiMcm0PllbWetRcULUkI3ZthuyXxTb4oYti3oqSZXFdSQt+S01HUC4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=ZE6MsNTvZCF5ZtNp9AtD0L1csA7nKNckvlJPXVDC9JByhJ18FBEh6CLtR9ydMEid+t7uqRk+C1h3APSjpapaN+jlwzCkEVlAohx+j533gkaV7WWvzAfOOkY8LnEm1kRPp73lvzjVguQGE11QJOxKrrjguF7wOV1wAJkejbQaHvs= Received: by 10.114.137.2 with SMTP id k2mr2191779wad.1174835252838; Sun, 25 Mar 2007 08:07:32 -0700 (PDT) Received: by 10.114.58.3 with HTTP; Sun, 25 Mar 2007 08:07:32 -0700 (PDT) Message-ID: <359a92830703250807t25681983n845e27ba6e27ca84@mail.gmail.com> Date: Sun, 25 Mar 2007 11:07:32 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: MergeFactor and MaxBufferedDocs value should ...? In-Reply-To: <1174660100.1862.1181022999@webmail.messagingengine.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_246429_33411936.1174835252692" References: <4628d2a90703222351j23880dccr6521586fb127e907@mail.gmail.com> <1174644692.28699.1180984713@webmail.messagingengine.com> <4628d2a90703230605l6ae89fe4secd81c88bd715c67@mail.gmail.com> <359a92830703230704k6b4f4c49h88d22e4e01bdc91f@mail.gmail.com> <1174660100.1862.1181022999@webmail.messagingengine.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_246429_33411936.1174835252692 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline I should add that in my situation, the number of documents that fit in ram is...er...problematical to determine. My current project is composed of books that I chose to index as a single book at a time. Unfortunately, answering the question "how big is a book" doesn't help much, they range from 2 pages to over 7,000 pages. So how to set the various indexing parameters, especially maxBufferedDocs is hard to balance between efficiency and memory. Will I happen to get a string of 100 large books? If so, I need to set the limit to a small number. Which will not be terribly efficient for the "usual" case. That said, I don't much care about efficiency in this case. I can't generate the index quickly (20,000+ books) and the factors I've chosen let me generate it between the time I leave work and the time I get back in the morning, so I don't really need much more tweaking. But this illustrates why I referred to picking factors as a "guess". With a heterogeneous index where the documents vary widely in size, picking parameters isn't straight-forward. My current parameters may not work if I index the documents in a different order than I am currently. I just don't know. They may even not work on the next set of data, since much of the data is OCR and for many books it's pretty trashy and/or incomplete (imagine the OCR output of a genealogy book that consists entirely of a stylized tree with the names written by hand along the branches in many orientations!). We're promised much better OCR data in the next set of books we index, which may blow my current indexer out of the watter. Which is why I'm so glad that the ramSizeInBytes has been added. It seems to me that I can now create a reasonably generalized way to index heterogeneous documents with "good enough" efficiency. I'm imagining keeping a few simple statistics, like size of incoming document and change in index size as a result of indexing that doc. This should allow me to figure out a reasonable factor for predicting how much the *next* addition will increase the index and flushing ram based upon that prediction. With, probably, quite a large safety margin. I don't really care if I get every last efficiency in this case. What I *do* care about is that the indexing run completes and this new capability seems to allow me to insure that without penalizing the bulk of my indexing because I have a few edge cases. Anyway, thanks for adding this capability, which I'll probably use in the pretty near future. And thanks Michael for your explanation of what these factors really do. It may have been documented before, but this time it finally is sticking in my aging brain... Erick On 3/23/07, Michael McCandless wrote: > > > "Erick Erickson" wrote: > > I haven't used it yet, but I've seen several references to > > IndexWriter.ramSizeInBytes() and using it to control when the writer > > flushes the RAM. This seems like a more deterministic way of > > making things efficient than trying various combinations of > > maxBufferedDocs , MergeFactor, etc, all of which are guesses > > at best. > > I agree this is the most efficient way to flush. The one caveat is > this Jira issue: > > http://issues.apache.org/jira/browse/LUCENE-845 > > which can cause over-merging if you make maxBufferedDocs too large. > > I think the rule of thumb to avoid this issue is 1) set > maxBufferedDocs to be no more than 10X the "typical" number of docs > you will flush, and then 2) flush by RAM usage. > > So for example if when you flush by RAM you typically flush "around" > 200-300 docs, then setting maxBufferedDocs to eg 1000 is good since > it's far above 200-300 (so it won't trigger a flush when you didn't > want it to) but it's also well below 10X your range of docs (so it > won't tickle the above bug). > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_246429_33411936.1174835252692--