Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 21495 invoked from network); 8 Apr 2009 15:25:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Apr 2009 15:25:37 -0000 Received: (qmail 71116 invoked by uid 500); 8 Apr 2009 15:25:36 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 71050 invoked by uid 500); 8 Apr 2009 15:25:36 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 71042 invoked by uid 99); 8 Apr 2009 15:25:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Apr 2009 15:25:36 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Apr 2009 15:25:34 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 0537E234C004 for ; Wed, 8 Apr 2009 08:25:13 -0700 (PDT) Message-ID: <1564157252.1239204312949.JavaMail.jira@brutus> Date: Wed, 8 Apr 2009 08:25:12 -0700 (PDT) From: "Shai Erera (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1539) Improve Benchmark In-Reply-To: <1346552103.1234291740340.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697057#action_12697057 ] Shai Erera commented on LUCENE-1539: ------------------------------------ Is it also interesting to add extensions to EnwikiDocMaker, WriteLineDoc and LineDocMaker which can read/write the content in a bzip format? I downloaded the latest Enwiki dump, 4.5 GB in bzip format. Extracted XML size is 17GB. I thought to myslef that I don't have a real reason to extract it - I can read the content directly from the bzip-type file. So I looked around and found out that in ant.jar there are two classes which can read/write that format. Just to compare, I gzipped the XML file and the result was 5.1GB file (~13% larger). The general measurements on the web also show bzip is superior to gzip, although it probably runs a bit slower. I then ran the WriteLineDoc task, to produce the one-line-per-document text file, and stopped when it reache 228MB. Again, I zipped, gzipped and bzipped the file, and the bzip format was smaller by ~20%. So I was wondering - besides the speed of writing from a compressed archive, which is slwoer than reading from a plain XML or TXT file, is there a reason why we don't use bzip/gzip when reading content? It will save a lot of space and I'm not sure that part of the indexing is what's most important. However, I'm aware that some people might find it better to read from plain files, so I suggest we just have extensions which can read/write the compressed format. The question is, assuming you agree to it, should we use bzip (which requires external library) or gzip which is in the JDK, does not compress as good as bzip, but might have better performance (I can give it some measurements if needed, but the main question I have is whether we want to introduce a dependency on another library). If this belongs in a separate issue, let me know. > Improve Benchmark > ----------------- > > Key: LUCENE-1539 > URL: https://issues.apache.org/jira/browse/LUCENE-1539 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py > > Original Estimate: 336h > Remaining Estimate: 336h > > Benchmark can be improved by incorporating recent suggestions posted > on java-dev. M. McCandless' Python scripts that execute multiple > rounds of tests can either be incorporated into the codebase or > converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org