avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-135) add compression to data files
Date Mon, 11 Jan 2010 23:54:54 GMT

    [ https://issues.apache.org/jira/browse/AVRO-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798941#action_12798941

Scott Carey commented on AVRO-135:

Lets just put deflate or gzip in here for this release.  This is the least amount of work,
and so long as we make it 'gzip 1' equivalent it isn't that slow.  'gzip 1' is about 4x faster
compressing than the normal default of 'gzip 6', with 'gzip 1' on today's CPUs typically between
35MB/sec and 70MB/sec throughput on compression.

I have some code I can contribute for Java that replaces GzipOutputStream to allow control
over the compression ratio and exposes the number of bytes written (compressed and uncompressed).
  This can use the Hadoop optimized PureJavaCRC32 for best performance as well.
At this point, I am eager to use 1.3 and will write such a codec anyway if it is not supported
(not sure if gzip or deflate, but that is a minor issue on ~64k blocks).  

LZF/LZO/FastLZ would be nice, but that is more involved.
I've been working on some pure java LZF implementations as an experiment.  I chose this over
FastLZ because the code was a lot easier to undersdand and much better documented (though
both are lacking).   Additionally, FastLZ warns that the format may change at any time on
their site, which also kept me away.  
Short story -- the JIT isn't good enough in Java 6 or Java 7 to do the right low level optimizations
to catch up to native code yet, but I can get compression rates about 80 to 120MB/sec and
decompression between 100 and 160MB/sec with it and compression ratios just slightly worse
than LZO but better than the C LZF code.

If FastLZ's java library is used, its can use some performance improvements.

> add compression to data files
> -----------------------------
>                 Key: AVRO-135
>                 URL: https://issues.apache.org/jira/browse/AVRO-135
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Priority: Blocker
>             Fix For: 1.3.0
> We should add support for at least one compression codec to data files.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message