lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields
Date Fri, 11 Aug 2006 21:24:15 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-648?page=comments#action_12427630 ] 
            
Michael McCandless commented on LUCENE-648:
-------------------------------------------


OK I ran some basic benchmarks to test the effect on indexing of
varying the ZIP compression level from 0-9.

Lucene currently hardwires compression level at 9 (= BEST).

I found a decent text corpus here:

     http://people.csail.mit.edu/koehn/publications/europarl

I ran all tests on the "Portuguese-English" data set, which is total
of 327.5 MB of plain text across 976 files.

I just ran the demo IndexFiles, modified to add the file contents as
only a compressed stored field (ie not indexed).  Note that this
"amplifies" the cost of compression because in a real setting there
would also be a number of indexed fields.

I didn't change any of the default merge factor settings.  I'm running
on Ubuntu Linux 6.06, single CPU (2.4 ghz Pentium 4) desktop machine with
index stored on an internal ATA hard drive.

I first tested indexing time with and without the patch from
LUCENE-629 here:

      old version: 648.7 sec

  patched version: 145.5 sec

We clearly need to get that patch committed & released!  Compressed
fields are far more costly than they ought to be, and people are now
using this (as of 1.9 release).

So, then I ran all subsequent tests with the above patch applied.  All
numbers are avg. of 3 runs:

  Level  Index time (sec)  Index size (MB)

   None              65.3            322.3          
      0              92.3            322.3
      1              80.8            128.8
      2              80.6            122.2
      3              81.3            115.8
      4              89.8            111.3
      5             104.0            106.2
      6             121.8            103.6
      7             131.7            103.1
      8             144.8            102.9
      9             145.5            102.9

Quick conclusions:

  * There is indeed a substantial variance when you change the compression
    level.

  * The "sweet spot" above seems to be around 4 or 5 -- should we
    change the default from 9?

  * I would still say we should make it possible for Lucene users to
    change the compression level?


> Allow changing of ZIP compression level for compressed fields
> -------------------------------------------------------------
>
>                 Key: LUCENE-648
>                 URL: http://issues.apache.org/jira/browse/LUCENE-648
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.0.1
>            Reporter: Michael McCandless
>            Priority: Minor
>
> In response to this thread:
>       http://www.gossamer-threads.com/lists/lucene/java-user/38810
> I think we should allow changing the compression level used in the call to java.util.zip.Deflator
in FieldsWriter.java.  Right now it's hardwired to "best":
>       compressor.setLevel(Deflater.BEST_COMPRESSION);
> Unfortunately, this can apparently cause the zip library to take a very long time (10
minutes for 4.5 MB in the above thread) and so people may want to change this setting.
> One approach would be to read the default from a Java system property, but, it seems
recently (pre 2.0 I think) there was an effort to not rely on Java System properties (many
were removed).
> A second approach would be to add static methods (and static class attr) to globally
set the compression level?
> A third method would be in document.Field class, eg a setCompressLevel/getCompressLevel?
 But then every time a document is created with this field you'd have to call setCompressLevel
since Lucene doesn't have a global Field schema (like Solr).
> Any other ideas / prefererences for either of these methods?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message