spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <>
Subject [jira] [Commented] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes
Date Mon, 19 Oct 2015 06:35:05 GMT


Josh Rosen commented on SPARK-11177:

It looks like this is caused by MAPREDUCE-4470, which is not patched in Apache Hadoop 1.x
releases. If Spark users cannot upgrade to Hadoop 2.x and absolutely need a fix for this,
then one somewhat hacky solution is to use a modified copy of CombineFileInputFormat which
lives in the Spark source tree and includes the three-line fix for MAPREDUCE-4470. While this
works (I have tests!), it's not an approach which is suitable for inclusion in a Spark release:
it's going to be borderline impossible to maintain source- and binary-compatibility with all
of our supported Hadoop versions while using this approach.

> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes
> -----------------------------------------------------------------------------------
>                 Key: SPARK-11177
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files is empty
(0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(
> at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(
> at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> *
> *
> *

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message