hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin J. Price (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6297) Hadoop's support for zlib library lacks support to perform flushes (Z_SYNC_FLUSH and Z_FULL_FLUSH)
Date Thu, 24 Feb 2011 06:15:38 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998702#comment-12998702

Kevin J. Price commented on HADOOP-6297:

SequenceFile just compresses blocks of input into variable output block sizes, this is different
from having fixed-size output blocks.

The theory is that if the compressed block size is fixed, and an even divisor of the HDFS
block size, then a naive 'split at the HDFS block boundaries' will work without having to
do any seqing around at the start of each mapper.  Theoretically you get less start-of-mapper
overhead and less reading from blocks that might not be rack local.

I'm honestly not certain anymore that it's the best approach. I have my scheme set up using
a little JNI code I threw together that provides full zlib support, and the overall performance
gains over sequence files are fairly negligible.

It's still functionality that's missing from the Hadoop code that would be easy to add, though.
 (Oracle is finally fixing this issue in the Java Zlib implementation as part of Java 7.)

> Hadoop's support for zlib library lacks support to perform flushes (Z_SYNC_FLUSH and
> --------------------------------------------------------------------------------------------------
>                 Key: HADOOP-6297
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6297
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Kevin J. Price
>            Assignee: Kevin J. Price
>            Priority: Minor
>         Attachments: zlibpatch-0.3.patch, zlibpatch.patch
> The zlib library supports the ability to perform two types of flushes when deflating
data. It can perform both a Z_SYNC_FLUSH, which forces all input to be written as output and
byte-aligned and resets the Huffman coding, and it also supports a Z_FULL_FLUSH, which does
the same thing but additionally resets the compression dictionary.  The Hadoop wrapper for
the zlib library does not support either of these two methods.
> Adding support should be fairly trivial.  An additional deflate method that takes a fourth
"flush" parameter, and a modification to the native c code to accept this fourth parameter
and pass it along to the zlib library.  I can submit a patch for this if desired.
> It should be noted that the native SUN Java API is likewise missing this functionality,
as has been noted for over a decade here: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4206909

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message