From core-dev-return-70596-apmail-hadoop-core-dev-archive=hadoop.apache.org@hadoop.apache.org Thu Jun 04 10:29:19 2009 Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 74283 invoked from network); 4 Jun 2009 10:29:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Jun 2009 10:29:19 -0000 Received: (qmail 45698 invoked by uid 500); 4 Jun 2009 10:29:30 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 45651 invoked by uid 500); 4 Jun 2009 10:29:30 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 45641 invoked by uid 99); 4 Jun 2009 10:29:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2009 10:29:30 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2009 10:29:27 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 5E153234C004 for ; Thu, 4 Jun 2009 03:29:07 -0700 (PDT) Message-ID: <1874826967.1244111347370.JavaMail.jira@brutus> Date: Thu, 4 Jun 2009 03:29:07 -0700 (PDT) From: "Abdul Qadeer (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files In-Reply-To: <1130594820.1219462845913.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716223#action_12716223 ] Abdul Qadeer commented on HADOOP-4012: -------------------------------------- (1) Running test-patch target on my local box does not produce release audit warnings: [exec] [exec] There appear to be 503 release audit warnings before the patch and 501 release audit warnings after applying the patch. [exec] [exec] [exec] [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] (2) Test org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement.testBlockReplacement does not fail on my local box and org.apache.hadoop.mapred.TestQueueCapacities.testMultipleQueues produces an error (But this is the same even for the committed SVN Code). These 1 or 2 errors are occurring for other patches as well on Hudson. Additionally they are un-related as far as bzip2 is concerned. Can the Hadoop committers please review the patch so that I can complete this work. Thank you. > Providing splitting support for bzip2 compressed files > ------------------------------------------------------ > > Key: HADOOP-4012 > URL: https://issues.apache.org/jira/browse/HADOOP-4012 > Project: Hadoop Core > Issue Type: New Feature > Components: io > Affects Versions: 0.21.0 > Reporter: Abdul Qadeer > Assignee: Abdul Qadeer > Fix For: 0.21.0 > > Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch, Hadoop-4012-version3.patch, Hadoop-4012-version4.patch, Hadoop-4012-version5.patch, Hadoop-4012-version6.patch, Hadoop-4012-version7.patch, Hadoop-4012-version8.patch, Hadoop-4012-version9.patch > > > Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting. > BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper). > We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.