Return-Path: Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: (qmail 71555 invoked from network); 10 Jul 2009 20:57:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Jul 2009 20:57:27 -0000 Received: (qmail 77461 invoked by uid 500); 10 Jul 2009 20:57:37 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 77424 invoked by uid 500); 10 Jul 2009 20:57:37 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 77414 invoked by uid 99); 10 Jul 2009 20:57:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jul 2009 20:57:37 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jul 2009 20:57:35 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id EAAD7234C052 for ; Fri, 10 Jul 2009 13:57:14 -0700 (PDT) Message-ID: <912077286.1247259434960.JavaMail.jira@brutus> Date: Fri, 10 Jul 2009 13:57:14 -0700 (PDT) From: "Yuri Pradkin (JIRA)" To: common-issues@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729814#action_12729814 ] Yuri Pradkin commented on HADOOP-4012: -------------------------------------- bq. -1 to changing LineRecordReader. In particular, you've undone changes that were made by other jiras. This is very very touchy code that the current version of this patch breaks. Can you please be more specific: what changes were undone? Can you please elaborate on what exactly this patch breaks (perhaps we need more tests?)? The code indeed is very heavily used, but I think it's also very well beaten upon by various tests. bq. I really think that this is better done by using a separate BzipTextInputFormat and BzipLineRecordReader. I think the point of having splitting built in - is that all readers/formats can avoid re-implementing common things. We are currently using a binary format reader that works just fine with this patch with only minor changes. Moreover, the idea is to work out a common framework where other block-compressed formats could be processed in a similar manner. The alternative that you're suggesting is to have BzipTextInputFormat, XXXBlockCompressInputFormat, YYYBlockCompressInputFormat, and so on. > Providing splitting support for bzip2 compressed files > ------------------------------------------------------ > > Key: HADOOP-4012 > URL: https://issues.apache.org/jira/browse/HADOOP-4012 > Project: Hadoop Common > Issue Type: New Feature > Components: io > Affects Versions: 0.21.0 > Reporter: Abdul Qadeer > Assignee: Abdul Qadeer > Fix For: 0.21.0 > > Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch, Hadoop-4012-version3.patch, Hadoop-4012-version4.patch, Hadoop-4012-version5.patch, Hadoop-4012-version6.patch, Hadoop-4012-version7.patch, Hadoop-4012-version8.patch, Hadoop-4012-version9.patch > > > Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting. > BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper). > We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.