hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suhas Gogate (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-3898) avoid bzip2 decompressor throwing exception on corrupted (prematurely truncated) file
Date Mon, 04 Aug 2008 20:42:44 GMT
avoid bzip2 decompressor throwing exception on corrupted (prematurely truncated) file
-------------------------------------------------------------------------------------

                 Key: HADOOP-3898
                 URL: https://issues.apache.org/jira/browse/HADOOP-3898
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.17.1
            Reporter: Suhas Gogate


running map-reduce streaming job using the bzip2 compressor, job fails with one of either
of the two following java exceptions:

This seems to happen when one of the bz2 input files is corrupted (probably when the file
is prematurely truncated).  Example,

Can we fix the bzip2 decompresser so that it does not throw the above two exceptions?


2008-07-16 07:23:39,605 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: mark/reset not supported
       at java.io.InputStream.reset(InputStream.java:334)
       at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:117)


       at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140)


       at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34)


       at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
       at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

or

2008-07-16 20:49:28,020 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: CRC error
        at 
org.apache.tools.bzip2r.CBZip2InputStream.cadvise(CBZip2InputStream.java:74)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.crcError(CBZip2InputStream.java:378)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.endBlock(CBZip2InputStream.java:351)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java:851)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java:903)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.read(CBZip2InputStream.java:240)
        at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:102)
        at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140)
        at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)


Example:
$HADOOP_HOME/bin/hadoop jar -libjars $<path>/jars/bzip2.jar 
$HADOOP_HOME/hadoop-streaming.jar \
  -inputformat org.apache.hadoop.mapred.Bzip2TextInputFormat \
  -mapper "cat" \
  -reducer "cat" \
  -numReduceTasks 20 \
  -input '<path>/corrupt-data.bz2'  \
  -output bzip2_bug_example \
  -jobconf stream.num.map.output.key.fields=1 \
  -jobconf stream.num.reduce.output.fields=1 \
  -jobconf num.key.fields.for.partition=1


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message