Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 24764 invoked from network); 12 Apr 2009 03:20:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Apr 2009 03:20:38 -0000 Received: (qmail 6136 invoked by uid 500); 12 Apr 2009 03:20:37 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 5998 invoked by uid 500); 12 Apr 2009 03:20:36 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 5990 invoked by uid 99); 12 Apr 2009 03:20:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Apr 2009 03:20:36 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Apr 2009 03:20:35 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id D25F5234C003 for ; Sat, 11 Apr 2009 20:20:14 -0700 (PDT) Message-ID: <1476656069.1239506414847.JavaMail.jira@brutus> Date: Sat, 11 Apr 2009 20:20:14 -0700 (PDT) From: "Shai Erera (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark In-Reply-To: <644592694.1239212773070.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698161#action_12698161 ] Shai Erera commented on LUCENE-1591: ------------------------------------ Before I post a patch I wanted to test reading the 20090306 enwiki dump and write it as a one line document, all using the bz2 in/out streams. After 9 hours and 2881000 documents (!!!), I've hit the following exception: {code} Exception in thread "Thread-1" java.lang.ArrayIndexOutOfBoundsException at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:76) at java.lang.Thread.run(Thread.java:810) {code} Same exception like Mike hit, only from a different method. I'm using the latest xerces jar Mike put. I'm beginning to think this enwiki dump is jinxed :) Anyway, I'll post the patch shortly and run on the 20070527 version to verify. > Enable bzip compression in benchmark > ------------------------------------ > > Key: LUCENE-1591 > URL: https://issues.apache.org/jira/browse/LUCENE-1591 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Reporter: Shai Erera > Fix For: 2.9 > > > bzip compression can aid the benchmark package by not requiring extracting bzip files (such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false and in the relevant tasks either decompress the input file or compress the output file using the bzip streams. > It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream and GZIPInputStream which compress/decompress files using the bzip algorithm. > bzip is known to be superior in its compression performance to the gzip algorithm (~20% better compression), although it does the compression/decompression a bit slower. > I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes, so it can be inherited by all sub-classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org