Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of felipe.albrecht@gmail.com
 designates 209.85.132.251 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=o1ddIgHKrcW5mK+qYRY+Relvxg8PhaHfwAcyck2IBs+t0D0QVwbxA+l8uKEEOulHSNuRxisrSseOZ2YE3LILXf5WgyGuRSc0obonHRtQHFjE3ndxzdluOr3+Cpa81wO6kLrLAyEkLbadntZuHxbOrBr60SVRb7Wms2rrjgNcw10=
Message-ID: <f6b04db30802152049v6b285380gda052f3026d88daf@mail.gmail.com>
Date: Sat, 16 Feb 2008 02:49:21 -0200
From: "Felipe Albrecht" <felipe.albrecht@gmail.com>
To: java-dev@lucene.apache.org
Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
In-Reply-To: <0674058A-9DCD-4E7D-9AC4-F33F5C5A4E05@mikemccandless.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <167357.79602.qm@web23006.mail.ird.yahoo.com>
	 <4CABF542-7CE8-4138-A54F-8E9E098E7FD1@mikemccandless.com>
	 <1CA1FB05-F1E1-4868-9C14-D528F683946E@apache.org>
	 <0674058A-9DCD-4E7D-9AC4-F33F5C5A4E05@mikemccandless.com>

Hello,

I have a simple question about this patch.

In the following patch segment, it is shown that the threadshould for
synchronize the data changed.

 if (ramBufferSize != IndexWriter.DISABLE_AUTO_FLUSH
- && numBytesUsed > 0.95 * ramBufferSize)
+ && numBytesUsed >= ramBufferSize)
 balanceRAM();

Why it was changed and it *may be* is not influencing some time result?
In other words, it's saying: "use more ram before to flush", and doing
larger flushes,
and less quantity of them, may be is influencing the final time.

I am a bit new in Lucene, ony 2 weeks, but it pointed my attention.

Thank you,

Felipe Albrecht


On Feb 11, 2008 5:30 PM, Michael McCandless <lucene@mikemccandless.com> wrote:
>
>
> Grant Ingersoll wrote:
>
> > Also, perhaps we should spin off another thread to discuss how to
> > make DocsWriter easier to maintain.  My biggest concern is
> > understanding how the various threads work together, and a few
> > other areas but, like I said, let's spin up a separate thread to
> > brainstorm what is needed.
>
> I agree we should work on simplifying it with time, and, spreading
> the knowledge of how it works.
>
>
> > Note, that there is some risk in just using wikipedia for profiling
> > given it's distribution of terms, etc..
>
> Good point.  Previously I was using Europarl, but, that corpus is
> just too fast to index.
>
> Are you thinking Wikipedia is somewhat "dirty" (lots of extra terms
> not normally seen with clean content)?  Since I'm using
> StandardAnalyzer and not an analyzer based on the new
> WikipediaTokenizer, I'm getting even extra terms.  Also, I think we'd
> need an HTMLFilter in the chain since Wikipedia content uses HTML
> markup.  Grant, what analyzer chain do you use when you index Wikipedia?
>
>
> > I also wonder if using the LineDocMaker is all that realistic a
> > profiling scenario.  While it is really useful in that it minimizes
> > IO interaction, etc. I can't help but feel that it isn't at all
> > close to typical usage.  Most users are not going to have all their
> > docs rolled up into a single file, 1 doc per line, so I wonder if
> > we potentially lose insight into how Lucene performs given that
> > other issues like I/O/memory used for loading files may force the
> > JVM/Lucene to not have the resources it needs.  Of course, I do
> > know it is good to try to isolate things so we can focus just on
> > Lucene, but we also should try to make some accounting for how it
> > lives in the wild.
>
> I agree, this part is not realistic, and the intention is to measure
> just the indexing time.  In fact I expect most apps spend quite a bit
> more time building up a Document (filtering binary docs, etc) than
> actually indexing it.  The only real-world app that I can think of
> that would be close to LineDocMaker is using Lucene to search big log
> files, where one line = one Document.
>
>
> > Last, I think it would be good to always attach/check in the .alg
> > file that is used when running the test, so that others can verify
> > on different systems/configurations, etc.
>
> I did post the alg (under LUCENE-1172).  Though I see I forgot to
> {code} it and it looks messed up now.  My recent test to try a single
> quickSort(Object[]) were the same alg, just repeated 10 times instead
> of 3.
>
> But I agree we should always post the alg for all tests...
>
>
>
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org