Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 2727 invoked from network); 11 Aug 2006 21:27:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 11 Aug 2006 21:27:06 -0000 Received: (qmail 84611 invoked by uid 500); 11 Aug 2006 21:27:06 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 83985 invoked by uid 500); 11 Aug 2006 21:27:04 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 83973 invoked by uid 99); 11 Aug 2006 21:27:04 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Aug 2006 14:27:04 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Aug 2006 14:27:03 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A133971428E for ; Fri, 11 Aug 2006 21:24:15 +0000 (GMT) Message-ID: <4200678.1155331455633.JavaMail.jira@brutus> Date: Fri, 11 Aug 2006 14:24:15 -0700 (PDT) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields In-Reply-To: <7189075.1155222434115.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/LUCENE-648?page=comments#action_12427630 ] Michael McCandless commented on LUCENE-648: ------------------------------------------- OK I ran some basic benchmarks to test the effect on indexing of varying the ZIP compression level from 0-9. Lucene currently hardwires compression level at 9 (= BEST). I found a decent text corpus here: http://people.csail.mit.edu/koehn/publications/europarl I ran all tests on the "Portuguese-English" data set, which is total of 327.5 MB of plain text across 976 files. I just ran the demo IndexFiles, modified to add the file contents as only a compressed stored field (ie not indexed). Note that this "amplifies" the cost of compression because in a real setting there would also be a number of indexed fields. I didn't change any of the default merge factor settings. I'm running on Ubuntu Linux 6.06, single CPU (2.4 ghz Pentium 4) desktop machine with index stored on an internal ATA hard drive. I first tested indexing time with and without the patch from LUCENE-629 here: old version: 648.7 sec patched version: 145.5 sec We clearly need to get that patch committed & released! Compressed fields are far more costly than they ought to be, and people are now using this (as of 1.9 release). So, then I ran all subsequent tests with the above patch applied. All numbers are avg. of 3 runs: Level Index time (sec) Index size (MB) None 65.3 322.3 0 92.3 322.3 1 80.8 128.8 2 80.6 122.2 3 81.3 115.8 4 89.8 111.3 5 104.0 106.2 6 121.8 103.6 7 131.7 103.1 8 144.8 102.9 9 145.5 102.9 Quick conclusions: * There is indeed a substantial variance when you change the compression level. * The "sweet spot" above seems to be around 4 or 5 -- should we change the default from 9? * I would still say we should make it possible for Lucene users to change the compression level? > Allow changing of ZIP compression level for compressed fields > ------------------------------------------------------------- > > Key: LUCENE-648 > URL: http://issues.apache.org/jira/browse/LUCENE-648 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 1.9, 2.0.0, 2.1, 2.0.1 > Reporter: Michael McCandless > Priority: Minor > > In response to this thread: > http://www.gossamer-threads.com/lists/lucene/java-user/38810 > I think we should allow changing the compression level used in the call to java.util.zip.Deflator in FieldsWriter.java. Right now it's hardwired to "best": > compressor.setLevel(Deflater.BEST_COMPRESSION); > Unfortunately, this can apparently cause the zip library to take a very long time (10 minutes for 4.5 MB in the above thread) and so people may want to change this setting. > One approach would be to read the default from a Java system property, but, it seems recently (pre 2.0 I think) there was an effort to not rely on Java System properties (many were removed). > A second approach would be to add static methods (and static class attr) to globally set the compression level? > A third method would be in document.Field class, eg a setCompressLevel/getCompressLevel? But then every time a document is created with this field you'd have to call setCompressLevel since Lucene doesn't have a global Field schema (like Solr). > Any other ideas / prefererences for either of these methods? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org