Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com
 designates 209.85.132.248 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=ZE6MsNTvZCF5ZtNp9AtD0L1csA7nKNckvlJPXVDC9JByhJ18FBEh6CLtR9ydMEid+t7uqRk+C1h3APSjpapaN+jlwzCkEVlAohx+j533gkaV7WWvzAfOOkY8LnEm1kRPp73lvzjVguQGE11QJOxKrrjguF7wOV1wAJkejbQaHvs=
Message-ID: <359a92830703250807t25681983n845e27ba6e27ca84@mail.gmail.com>
Date: Sun, 25 Mar 2007 11:07:32 -0400
From: "Erick Erickson" <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: MergeFactor and MaxBufferedDocs value should ...?
In-Reply-To: <1174660100.1862.1181022999@webmail.messagingengine.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_246429_33411936.1174835252692"
References: <4628d2a90703222351j23880dccr6521586fb127e907@mail.gmail.com>
	 <1174644692.28699.1180984713@webmail.messagingengine.com>
	 <4628d2a90703230605l6ae89fe4secd81c88bd715c67@mail.gmail.com>
	 <359a92830703230704k6b4f4c49h88d22e4e01bdc91f@mail.gmail.com>
	 <1174660100.1862.1181022999@webmail.messagingengine.com>

------=_Part_246429_33411936.1174835252692
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

I should add that in my situation, the number of documents that
fit in ram is...er...problematical to determine. My current project
is composed of books that I chose to index as a single book at a
time.

Unfortunately, answering the question "how big is a book" doesn't
help much, they range from 2 pages to over 7,000 pages. So how
to set the various indexing parameters, especially maxBufferedDocs
is hard to balance between efficiency and memory. Will I happen
to get a string of 100 large books? If so, I need to set the limit
to a small number. Which will not be terribly efficient for the "usual"
case.

That said, I don't much care about efficiency in this case. I can't
generate the index quickly (20,000+ books) and the factors I've
chosen let me generate it between the time I leave work and the
time I get back in the morning, so I don't really need much more
tweaking.

But this illustrates why I referred to picking factors as a "guess".
With a heterogeneous index where the documents vary widely
in size, picking parameters isn't straight-forward. My current
parameters may not work if I index the documents in a different
order than I am currently. I just don't know.

They may even not work on the next set of data, since much of
the data is OCR and for many books it's pretty trashy and/or
incomplete (imagine the OCR output of a genealogy book that
consists entirely of a stylized tree with the names written
by hand along the branches in many orientations!).

We're promised much better OCR data in the next set of books
we index, which may blow my current indexer out of the watter.

Which is why I'm so glad that the ramSizeInBytes has been
added. It seems to me that I can now create a reasonably
generalized way to index heterogeneous documents with
"good enough" efficiency. I'm imagining keeping a few
simple statistics, like size of incoming document and
change in index size as a result of indexing that doc. This
should allow me to figure out a reasonable factor for
predicting how much the *next* addition will increase the index
and flushing ram based upon that prediction. With, probably,
quite a large safety margin.

I don't really care if I get every last efficiency in this case. What
I *do* care about is that the indexing run completes and this
new capability seems to allow me to insure that without
penalizing the bulk of my indexing because I have  a few edge
cases.

Anyway, thanks for adding this capability, which I'll probably
use in the pretty near future.

And thanks Michael for your explanation of what these factors
really do. It may have been documented before, but this time
it finally is sticking in my aging brain...

Erick


On 3/23/07, Michael McCandless <lucene@mikemccandless.com> wrote:
>
>
> "Erick Erickson" <erickerickson@gmail.com> wrote:
> > I haven't used it yet, but I've seen several references to
> > IndexWriter.ramSizeInBytes() and using it to control when the writer
> > flushes the RAM. This seems like a more deterministic way of
> > making things efficient than trying various combinations of
> > maxBufferedDocs , MergeFactor, etc, all of which are guesses
> > at best.
>
> I agree this is the most efficient way to flush.  The one caveat is
> this Jira issue:
>
>   http://issues.apache.org/jira/browse/LUCENE-845
>
> which can cause over-merging if you make maxBufferedDocs too large.
>
> I think the rule of thumb to avoid this issue is 1) set
> maxBufferedDocs to be no more than 10X the "typical" number of docs
> you will flush, and then 2) flush by RAM usage.
>
> So for example if when you flush by RAM you typically flush "around"
> 200-300 docs, then setting maxBufferedDocs to eg 1000 is good since
> it's far above 200-300 (so it won't trigger a flush when you didn't
> want it to) but it's also well below 10X your range of docs (so it
> won't tickle the above bug).
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_246429_33411936.1174835252692--