Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 99263 invoked from network); 16 Feb 2008 04:49:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Feb 2008 04:49:56 -0000 Received: (qmail 24938 invoked by uid 500); 16 Feb 2008 04:49:44 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 24893 invoked by uid 500); 16 Feb 2008 04:49:44 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 24882 invoked by uid 99); 16 Feb 2008 04:49:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Feb 2008 20:49:44 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of felipe.albrecht@gmail.com designates 209.85.132.251 as permitted sender) Received: from [209.85.132.251] (HELO an-out-0708.google.com) (209.85.132.251) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Feb 2008 04:49:13 +0000 Received: by an-out-0708.google.com with SMTP id c5so208012anc.49 for ; Fri, 15 Feb 2008 20:49:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=Z0mgFsc8UhpFJuhpkdYtpxfKc+U/G7h33awxYGn0yzI=; b=NM1YxHlPw++OzcTnTUohW0a0uJ9Q+Px4c01S3kgb4JU8biOt3pa296FPAzxMjg+BExwY4gCqPtvKTWb6q3j1t2ysdaygemskBNNgHcYZQL9aVX68R1B8Pm5xmE/GPGOAuHjdxb/UHGgl6tazujuURtg9+Lh+PYGG85AlIuNHMiE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=o1ddIgHKrcW5mK+qYRY+Relvxg8PhaHfwAcyck2IBs+t0D0QVwbxA+l8uKEEOulHSNuRxisrSseOZ2YE3LILXf5WgyGuRSc0obonHRtQHFjE3ndxzdluOr3+Cpa81wO6kLrLAyEkLbadntZuHxbOrBr60SVRb7Wms2rrjgNcw10= Received: by 10.101.70.5 with SMTP id x5mr5460878ank.59.1203137361575; Fri, 15 Feb 2008 20:49:21 -0800 (PST) Received: by 10.100.108.6 with HTTP; Fri, 15 Feb 2008 20:49:21 -0800 (PST) Message-ID: Date: Sat, 16 Feb 2008 02:49:21 -0200 From: "Felipe Albrecht" To: java-dev@lucene.apache.org Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter In-Reply-To: <0674058A-9DCD-4E7D-9AC4-F33F5C5A4E05@mikemccandless.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <167357.79602.qm@web23006.mail.ird.yahoo.com> <4CABF542-7CE8-4138-A54F-8E9E098E7FD1@mikemccandless.com> <1CA1FB05-F1E1-4868-9C14-D528F683946E@apache.org> <0674058A-9DCD-4E7D-9AC4-F33F5C5A4E05@mikemccandless.com> X-Virus-Checked: Checked by ClamAV on apache.org Hello, I have a simple question about this patch. In the following patch segment, it is shown that the threadshould for synchronize the data changed. if (ramBufferSize != IndexWriter.DISABLE_AUTO_FLUSH - && numBytesUsed > 0.95 * ramBufferSize) + && numBytesUsed >= ramBufferSize) balanceRAM(); Why it was changed and it *may be* is not influencing some time result? In other words, it's saying: "use more ram before to flush", and doing larger flushes, and less quantity of them, may be is influencing the final time. I am a bit new in Lucene, ony 2 weeks, but it pointed my attention. Thank you, Felipe Albrecht On Feb 11, 2008 5:30 PM, Michael McCandless wrote: > > > Grant Ingersoll wrote: > > > Also, perhaps we should spin off another thread to discuss how to > > make DocsWriter easier to maintain. My biggest concern is > > understanding how the various threads work together, and a few > > other areas but, like I said, let's spin up a separate thread to > > brainstorm what is needed. > > I agree we should work on simplifying it with time, and, spreading > the knowledge of how it works. > > > > Note, that there is some risk in just using wikipedia for profiling > > given it's distribution of terms, etc.. > > Good point. Previously I was using Europarl, but, that corpus is > just too fast to index. > > Are you thinking Wikipedia is somewhat "dirty" (lots of extra terms > not normally seen with clean content)? Since I'm using > StandardAnalyzer and not an analyzer based on the new > WikipediaTokenizer, I'm getting even extra terms. Also, I think we'd > need an HTMLFilter in the chain since Wikipedia content uses HTML > markup. Grant, what analyzer chain do you use when you index Wikipedia? > > > > I also wonder if using the LineDocMaker is all that realistic a > > profiling scenario. While it is really useful in that it minimizes > > IO interaction, etc. I can't help but feel that it isn't at all > > close to typical usage. Most users are not going to have all their > > docs rolled up into a single file, 1 doc per line, so I wonder if > > we potentially lose insight into how Lucene performs given that > > other issues like I/O/memory used for loading files may force the > > JVM/Lucene to not have the resources it needs. Of course, I do > > know it is good to try to isolate things so we can focus just on > > Lucene, but we also should try to make some accounting for how it > > lives in the wild. > > I agree, this part is not realistic, and the intention is to measure > just the indexing time. In fact I expect most apps spend quite a bit > more time building up a Document (filtering binary docs, etc) than > actually indexing it. The only real-world app that I can think of > that would be close to LineDocMaker is using Lucene to search big log > files, where one line = one Document. > > > > Last, I think it would be good to always attach/check in the .alg > > file that is used when running the test, so that others can verify > > on different systems/configurations, etc. > > I did post the alg (under LUCENE-1172). Though I see I forgot to > {code} it and it looks messed up now. My recent test to try a single > quickSort(Object[]) were the same alg, just repeated 10 times instead > of 3. > > But I agree we should always post the alg for all tests... > > > > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org