hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abdul Qadeer (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-4012) Providing splitting support for bzip2 compressed files
Date Sat, 23 Aug 2008 03:40:45 GMT
Providing splitting support for bzip2 compressed files
------------------------------------------------------

                 Key: HADOOP-4012
                 URL: https://issues.apache.org/jira/browse/HADOOP-4012
             Project: Hadoop Core
          Issue Type: New Feature
          Components: io
    Affects Versions: 0.19.0
            Reporter: Abdul Qadeer
            Assignee: Abdul Qadeer


Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the
limitation of many codecs that they need the whole input stream to decompress successfully).
 So in such a case, Hadoop prepares only one split per compressed file, where the lower split
limit is at 0 while the upper limit is the end of the file.  The consequence of this decision
is that, one compress file goes to a single mapper. Although it circumvents the limitation
of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible
otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data
and later these compressed blocks can be decompressed independent of each other.  This is
indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can
process chunks of file in parallel.  The correctness criteria of such a processing is that
for a bzip2 compressed file, each compressed block should be processed
by only one mapper and ultimately all the blocks of the file should be processed.  (By processing
we mean the actual utilization of that un-compressed data (coming out of the codecs) in a
mapper).

We are writing the code to implement this suggested functionality.  Although we have used
bzip2 as an example, but we have tried to extend Hadoop's
compression interfaces so that any other codecs with the same capability as that of bzip2,
could easily use the splitting support.  The details of these changes will be posted when
we submit the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message