hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-15543) IndexOutOfBoundsException when reading bzip2-compressed SequenceFile
Date Fri, 15 Jun 2018 09:39:00 GMT
Sebastian Nagel created HADOOP-15543:
----------------------------------------

             Summary: IndexOutOfBoundsException when reading bzip2-compressed SequenceFile
                 Key: HADOOP-15543
                 URL: https://issues.apache.org/jira/browse/HADOOP-15543
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 3.1.0
            Reporter: Sebastian Nagel


When reading a bzip2-compressed SequenceFile, Hadoop jobs fail with: 
{noformat}
IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046)
{noformat}

The SequenceFile (669 MB) has been written with the properties
 - mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
- mapreduce.output.fileoutputformat.compress.type=BLOCK

using the native bzip2 library on Hadoop CDH 5.14.2 (Ubuntu 16.04, libbz2-1.0 1.0.6-8).

The error was seen on two development systems (local mode, no native bzip2 lib configured/installed)
and, so far, is reproducible with Hadoop 3.1.0 and CDH 5.14.2.

The following Hadoop releases are not affected:  2.7.4, 3.02, CDH 5.14.0. The SequenceFile
is read successfully when these Hadoop packages are used.

If required I can share the SequenceFile. It's a Nutch CrawlDb (contains [CrawlDatum|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java]
objects.

Full-stack as seen with 3.1.0:
{noformat}
2018-06-15 10:34:43,198 INFO  mapreduce.Job -  map 93% reduce 0%
2018-06-15 10:34:43,532 WARN  mapred.LocalJobRunner - job_local543410164_0001
java.lang.Exception: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) >
dest.length(678046).
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
Caused by: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046).
        at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
        at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:496)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at java.io.DataInputStream.readFully(DataInputStream.java:169)
        at org.apache.hadoop.io.WritableUtils.readString(WritableUtils.java:125)
        at org.apache.hadoop.io.WritableUtils.readStringArray(WritableUtils.java:169)
        at org.apache.nutch.protocol.ProtocolStatus.readFields(ProtocolStatus.java:177)
        at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:188)
        at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:332)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
        at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2374)
        at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2358)
        at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:568)
        at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message