hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15543) IndexOutOfBoundsException when reading bzip2-compressed SequenceFile
Date Fri, 15 Jun 2018 10:14:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513637#comment-16513637
] 

Steve Loughran commented on HADOOP-15543:
-----------------------------------------

It'd probably be good to stick that up somewhere? home.apache.org? so we can have a look at
it. 

what happens if you try and use the OS unzip tools?

I think we need to work out if this is something wrong with the reader code, or the writer
has generated something bad. Then find out who knows the native code well enough to fix.

FWIW, I don't see any changes in the native bzip code since 2015 (HADOOP-10027).

Looking at the java code, the only 3.1 change in the area is HADOOP-6852 and the BZip2Codec.
Going to tag that as the cause unless we can see otherwise.


> IndexOutOfBoundsException when reading bzip2-compressed SequenceFile
> --------------------------------------------------------------------
>
>                 Key: HADOOP-15543
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15543
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.1.0
>            Reporter: Sebastian Nagel
>            Priority: Major
>
> When reading a bzip2-compressed SequenceFile, Hadoop jobs fail with: 
> {noformat}
> IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046)
> {noformat}
> The SequenceFile (669 MB) has been written with the properties
>  - mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
> - mapreduce.output.fileoutputformat.compress.type=BLOCK
> using the native bzip2 library on Hadoop CDH 5.14.2 (Ubuntu 16.04, libbz2-1.0 1.0.6-8).
> The error was seen on two development systems (local mode, no native bzip2 lib configured/installed)
and, so far, is reproducible with Hadoop 3.1.0 and CDH 5.14.2.
> The following Hadoop releases are not affected:  2.7.4, 3.02, CDH 5.14.0. The SequenceFile
is read successfully when these Hadoop packages are used.
> If required I can share the SequenceFile. It's a Nutch CrawlDb (contains [CrawlDatum|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java]
objects.
> Full-stack as seen with 3.1.0:
> {noformat}
> 2018-06-15 10:34:43,198 INFO  mapreduce.Job -  map 93% reduce 0%
> 2018-06-15 10:34:43,532 WARN  mapred.LocalJobRunner - job_local543410164_0001
> java.lang.Exception: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659)
> dest.length(678046).
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
> Caused by: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046).
>         at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
>         at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:496)
>         at java.io.DataInputStream.readFully(DataInputStream.java:195)
>         at java.io.DataInputStream.readFully(DataInputStream.java:169)
>         at org.apache.hadoop.io.WritableUtils.readString(WritableUtils.java:125)
>         at org.apache.hadoop.io.WritableUtils.readStringArray(WritableUtils.java:169)
>         at org.apache.nutch.protocol.ProtocolStatus.readFields(ProtocolStatus.java:177)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:188)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:332)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
>         at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2374)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2358)
>         at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
>         at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:568)
>         at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
>         at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message