From common-issues-return-153789-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Fri Jun 15 11:39:07 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id C088618066C for ; Fri, 15 Jun 2018 11:39:06 +0200 (CEST) Received: (qmail 95437 invoked by uid 500); 15 Jun 2018 09:39:04 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 95127 invoked by uid 99); 15 Jun 2018 09:39:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Jun 2018 09:39:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 2DD9A1A28C2 for ; Fri, 15 Jun 2018 09:39:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.301 X-Spam-Level: X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id j3wZ83Kk_STb for ; Fri, 15 Jun 2018 09:39:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 1072B5F57F for ; Fri, 15 Jun 2018 09:39:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 09BE3E0CCD for ; Fri, 15 Jun 2018 09:39:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5632121843 for ; Fri, 15 Jun 2018 09:39:00 +0000 (UTC) Date: Fri, 15 Jun 2018 09:39:00 +0000 (UTC) From: "Sebastian Nagel (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HADOOP-15543) IndexOutOfBoundsException when reading bzip2-compressed SequenceFile MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Sebastian Nagel created HADOOP-15543: ---------------------------------------- Summary: IndexOutOfBoundsException when reading bzip2-compressed SequenceFile Key: HADOOP-15543 URL: https://issues.apache.org/jira/browse/HADOOP-15543 Project: Hadoop Common Issue Type: Bug Affects Versions: 3.1.0 Reporter: Sebastian Nagel When reading a bzip2-compressed SequenceFile, Hadoop jobs fail with: {noformat} IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046) {noformat} The SequenceFile (669 MB) has been written with the properties - mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec - mapreduce.output.fileoutputformat.compress.type=BLOCK using the native bzip2 library on Hadoop CDH 5.14.2 (Ubuntu 16.04, libbz2-1.0 1.0.6-8). The error was seen on two development systems (local mode, no native bzip2 lib configured/installed) and, so far, is reproducible with Hadoop 3.1.0 and CDH 5.14.2. The following Hadoop releases are not affected: 2.7.4, 3.02, CDH 5.14.0. The SequenceFile is read successfully when these Hadoop packages are used. If required I can share the SequenceFile. It's a Nutch CrawlDb (contains [CrawlDatum|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java] objects. Full-stack as seen with 3.1.0: {noformat} 2018-06-15 10:34:43,198 INFO mapreduce.Job - map 93% reduce 0% 2018-06-15 10:34:43,532 WARN mapred.LocalJobRunner - job_local543410164_0001 java.lang.Exception: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046). at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552) Caused by: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046). at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:496) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.io.WritableUtils.readString(WritableUtils.java:125) at org.apache.hadoop.io.WritableUtils.readStringArray(WritableUtils.java:169) at org.apache.nutch.protocol.ProtocolStatus.readFields(ProtocolStatus.java:177) at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:188) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:332) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2374) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2358) at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:568) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org