hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abdul Qadeer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files
Date Wed, 02 Sep 2009 18:00:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750566#action_12750566
] 

Abdul Qadeer commented on HADOOP-4012:
--------------------------------------

{quote}
{noformat}
+    if(in.getPos() <= start){
+      ((Seekable)seekableIn).seek(start);
+        in = this.createInputStream(seekableIn, readMode);
+        
+
+    }
{noformat}

This drops the stream it created above, which wrapped the stream passed in with a CBZip2InputStream
and BufferedInputStream. It's not clear why the stream is being re-created, either... particularly
since the start stored in the codec is left alone. What case is being handled here?
{quote}

The reason to re-create the stream for the case when in.getPos() <= start is to tackle
the cases like the following:

Assume [BBBBBB] represents a BZip2 maker and d is a single compressed data element (this can
happen
e.g. due to BZip2 concatenation)

There is some extra information at the start of stream i.e. BZ0h

^ indicates where currently the stream is:
{noformat}

[BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
_______________________________ ^

I go back 10 bytes in the stream before finding a marker.  The reason
is that the first 'maker' is 10 bytes long, all others are 6 bytes long.

So after going backwards the stream position is as follows:

[BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
__________________ ^

Now finding next marker might align us with the wrong marker as follows:

[BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
______________________ ^
{noformat}

So for such cases the code mentioned above works.  But you rightly mentioned that I should
have done this.start = start at the end of above code as well.


{quote}
I tried a version of this using a supertype of CompressionInputStream instead of the semantics
tried so far (voiding the synchronization discussion). It doesn't incorporate the other changes
discussed.
{quote}

The new version looks fine to me.  Let me incorporate the other changes you mentioned in it
and to put the new patch on the JIRA



> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
>                 Key: HADOOP-4012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4012
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.21.0
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>             Fix For: 0.21.0
>
>         Attachments: C4012-12.patch, Hadoop-4012-version1.patch, Hadoop-4012-version10.patch,
Hadoop-4012-version11.patch, Hadoop-4012-version2.patch, Hadoop-4012-version3.patch, Hadoop-4012-version4.patch,
Hadoop-4012-version5.patch, Hadoop-4012-version6.patch, Hadoop-4012-version7.patch, Hadoop-4012-version8.patch,
Hadoop-4012-version9.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split (mainly due
to the limitation of many codecs that they need the whole input stream to decompress successfully).
 So in such a case, Hadoop prepares only one split per compressed file, where the lower split
limit is at 0 while the upper limit is the end of the file.  The consequence of this decision
is that, one compress file goes to a single mapper. Although it circumvents the limitation
of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible
otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on blocks of
data and later these compressed blocks can be decompressed independent of each other.  This
is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we
can process chunks of file in parallel.  The correctness criteria of such a processing is
that for a bzip2 compressed file, each compressed block should be processed by only one mapper
and ultimately all the blocks of the file should be processed.  (By processing we mean the
actual utilization of that un-compressed data (coming out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality.  Although we have
used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that
any other codecs with the same capability as that of bzip2, could easily use the splitting
support.  The details of these changes will be posted when we submit the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message