Return-Path: Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: (qmail 58707 invoked from network); 24 Feb 2011 06:16:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Feb 2011 06:16:02 -0000 Received: (qmail 98033 invoked by uid 500); 24 Feb 2011 06:16:02 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 97847 invoked by uid 500); 24 Feb 2011 06:16:00 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 97839 invoked by uid 99); 24 Feb 2011 06:15:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Feb 2011 06:15:59 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Feb 2011 06:15:58 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 71FD31B496F for ; Thu, 24 Feb 2011 06:15:38 +0000 (UTC) Date: Thu, 24 Feb 2011 06:15:38 +0000 (UTC) From: "Kevin J. Price (JIRA)" To: common-issues@hadoop.apache.org Message-ID: <1028756401.12755.1298528138463.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] Commented: (HADOOP-6297) Hadoop's support for zlib library lacks support to perform flushes (Z_SYNC_FLUSH and Z_FULL_FLUSH) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998702#comment-12998702 ] Kevin J. Price commented on HADOOP-6297: ---------------------------------------- SequenceFile just compresses blocks of input into variable output block sizes, this is different from having fixed-size output blocks. The theory is that if the compressed block size is fixed, and an even divisor of the HDFS block size, then a naive 'split at the HDFS block boundaries' will work without having to do any seqing around at the start of each mapper. Theoretically you get less start-of-mapper overhead and less reading from blocks that might not be rack local. I'm honestly not certain anymore that it's the best approach. I have my scheme set up using a little JNI code I threw together that provides full zlib support, and the overall performance gains over sequence files are fairly negligible. It's still functionality that's missing from the Hadoop code that would be easy to add, though. (Oracle is finally fixing this issue in the Java Zlib implementation as part of Java 7.) > Hadoop's support for zlib library lacks support to perform flushes (Z_SYNC_FLUSH and Z_FULL_FLUSH) > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-6297 > URL: https://issues.apache.org/jira/browse/HADOOP-6297 > Project: Hadoop Common > Issue Type: Improvement > Components: io > Reporter: Kevin J. Price > Assignee: Kevin J. Price > Priority: Minor > Attachments: zlibpatch-0.3.patch, zlibpatch.patch > > > The zlib library supports the ability to perform two types of flushes when deflating data. It can perform both a Z_SYNC_FLUSH, which forces all input to be written as output and byte-aligned and resets the Huffman coding, and it also supports a Z_FULL_FLUSH, which does the same thing but additionally resets the compression dictionary. The Hadoop wrapper for the zlib library does not support either of these two methods. > Adding support should be fairly trivial. An additional deflate method that takes a fourth "flush" parameter, and a modification to the native c code to accept this fourth parameter and pass it along to the zlib library. I can submit a patch for this if desired. > It should be noted that the native SUN Java API is likewise missing this functionality, as has been noted for over a decade here: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4206909 -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira