avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-135) add compression to data files
Date Tue, 19 Jan 2010 00:56:54 GMT

    [ https://issues.apache.org/jira/browse/AVRO-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802009#action_12802009
] 

Scott Carey commented on AVRO-135:
----------------------------------

bq. in the spec, we should be very clear on whether we're using gzip or deflate, as these
are often confused. my slight preference would be for deflate, since it's more minimal, but
i also realize that adding another level of CRC will give lots of folks warm fuzzy feelings.

I lean towards deflate.  Though, we have to be very clear what we mean by that, there are
two kinds of 'deflate' interpretations.  
The first, is the raw compressed scheme, which has no crc or header (RFC 1951).  In Java,
this is the 'unwrapped' deflate variant.  Some web browsers call this "deflate".
The second is a deflate stream with an adler32 checksum, and is the format that a *.zip file
stores its individual entries (RFC 1950).  This is also known as the "ZLIB" format, but some
web browsers simply call it 'deflate'.  It has a 6 byte overhead (2 byte header, 4 byte adler32
checksum).

Lastly, is gzip (RFC 1952), which wraps raw deflate with a header and footer, which typically
have about 20 bytes overhead.

In the past, I've leaned towards gzip because if a file is written in this format, all sorts
of utilities can read it.  But we are storing compressed blocks within our own file format,
so there is no advantage to using gzip.  Furthermore, the Java API for gzip annoyingly removes
the ability to set the compression level and to find out the number of bytes output.   I think
that control over the compression level is highly important for users.
The Deflater API in Java does allow control over the compression level.

The 'ZLIB' deflate format has an adler32 checksum and 2 byte header (and is standardized),
so if we want a checksum we can choose that instead of gzip.
Otherwise, the raw deflate stream, perhaps with the uncompressed size prepended, would be
great.

> add compression to data files
> -----------------------------
>
>                 Key: AVRO-135
>                 URL: https://issues.apache.org/jira/browse/AVRO-135
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Philip Zeyliger
>            Priority: Blocker
>             Fix For: 1.3.0
>
>         Attachments: AVRO-135.patch.txt
>
>
> We should add support for at least one compression codec to data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message