hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Len Trigg (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
Date Thu, 27 Oct 2011 20:22:34 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137479#comment-13137479
] 

Len Trigg commented on HADOOP-6852:
-----------------------------------

We have been using the ant-based bzip2 library for our project and needed to be able to decompress
concatenated bzip files. After poking around we came across the hadoop extensions and immediately
found that it did not function correctly due to this bug. Essentially when crossing block
boundaries the skipToNextMarker method leaves the stream position at the end of the block
delimiter, but initBlock expects to be at the beginning of the block delimiter. After looking
at the poor structure of the initBlock method, and the thread-unsafety that has been introduced
into this class with the numberOfBytesTillNextMarker() method, we decided to avoid the hadoop
version of this class altogether. 
                
> apparent bug in concatenated-bzip2 support (decoding)
> -----------------------------------------------------
>
>                 Key: HADOOP-6852
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6852
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.22.0
>         Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15
>            Reporter: Greg Roelofs
>
> The following simplified code (manually picked out of testMoreBzip2() in https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch)
triggers a "java.io.IOException: bad block header" in org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(
CBZip2InputStream.java:527):
> {noformat}
>     JobConf jobConf = new JobConf(defaultConf);
>     CompressionCodec bzip2 = new BZip2Codec();
>     ReflectionUtils.setConf(bzip2, jobConf);
>     localFs.delete(workDir, true);
>     // copy multiple-member test file to HDFS
>     String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension();
>     Path fnLocal2 = new Path(System.getProperty("test.concat.data","/tmp"),fn2);
>     Path fnHDFS2  = new Path(workDir, fn2);
>     localFs.copyFromLocalFile(fnLocal2, fnHDFS2);
>     FileInputFormat.setInputPaths(jobConf, workDir);
>     final FileInputStream in2 = new FileInputStream(fnLocal2.toString());
>     CompressionInputStream cin2 = bzip2.createInputStream(in2);
>     LineReader in = new LineReader(cin2);
>     Text out = new Text();
>     int numBytes, totalBytes=0, lineNum=0;
>     while ((numBytes = in.readLine(out)) > 0) {
>       ++lineNum;
>       totalBytes += numBytes;
>     }
>     in.close();
> {noformat}
> The specified file is also included in the H-6835 patch linked above, and some additional
debug output is included in the commented-out test loop above.  (Only in the linked, "v4"
version of the patch, however--I'm about to remove the debug stuff for checkin.)
> It's possible I've done something completely boneheaded here, but the file, at least,
checks out in a subsequent set of subtests and with stock bzip2 itself.  Only the code above
is problematic; it reads through the first concatenated chunk (17 lines of text) just fine
but chokes on the header of the second one.  Altogether, the test file contains 84 lines of
text and 4 concatenated bzip2 files.
> (It's possible this is a mapreduce issue rather than common, but note that the identical
gzip test works fine.  Possibly it's related to the stream-vs-decompressor dichotomy, though;
intentionally not supported?)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message