Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 52D4011C13 for ; Sat, 17 May 2014 05:27:33 +0000 (UTC) Received: (qmail 86801 invoked by uid 500); 17 May 2014 04:42:02 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 53162 invoked by uid 500); 17 May 2014 04:27:02 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 51098 invoked by uid 99); 17 May 2014 04:16:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 May 2014 04:16:14 +0000 Date: Sat, 17 May 2014 04:16:14 +0000 (UTC) From: "Xiangrui Meng (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-5893) CBZip2InputStream is not threadsafe MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-5893?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 14000652#comment-14000652 ]=20 Xiangrui Meng commented on MAPREDUCE-5893: ------------------------------------------ Checked the code in the trunk. This class has a static boolean member `skip= Decompression`, which indicates whether it is decompressing or checking the= next marker. > CBZip2InputStream is not threadsafe > ----------------------------------- > > Key: MAPREDUCE-5893 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5893 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv1, mrv2 > Affects Versions: 1.2.1, 2.2.0 > Reporter: Xiangrui Meng > > Hadoop uses CBZip2InputStream to decode bzip2 files. However, the impleme= ntation is not threadsafe. This is not a really problem for Hadoop MapReduc= e because Hadoop runs each task in a separate JVM. But for other libraries = that utilize multithreading and use Hadoop's InputFormat, e.g., Spark, it w= ill cause exceptions like the following: > {code} > java.lang.ArrayIndexOutOfBoundsException: 6 org.apache.hadoop.io.compress= .bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:729) org= .apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(= CBZip2InputStream.java:795) org.apache.hadoop.io.compress.bzip2.CBZip2Input= Stream.initBlock(CBZip2InputStream.java:499) org.apache.hadoop.io.compress.= bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:3= 30) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputS= tream.java:394) org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionIn= putStream.read(BZip2Codec.java:428) java.io.InputStream.read(InputStream.ja= va:101) org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:2= 05) org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) org.apa= che.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) org.apac= he.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) org.apache= .spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) org.apache.spark.= rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) org.apache.spark.util.Ne= xtIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIte= rator.hasNext(InterruptibleIterator.scala:35) scala.collection.Iterator$$an= on$11.hasNext(Iterator.scala:327) org.apache.spark.util.Utils$.getIteratorS= ize(Utils.scala:1000) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.s= cala:847) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) or= g.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077= ) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:= 1077) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) o= rg.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor= .Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoo= lExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.Thre= adPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run= (Thread.java:724) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)