spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (SPARK-24610) wholeTextFiles broken for small files
Date Fri, 19 Oct 2018 21:47:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dongjoon Hyun updated SPARK-24610:
----------------------------------
    Comment: was deleted

(was: User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/22725)

> wholeTextFiles broken for small files
> -------------------------------------
>
>                 Key: SPARK-24610
>                 URL: https://issues.apache.org/jira/browse/SPARK-24610
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.2.1, 2.3.1
>            Reporter: Dhruve Ashar
>            Assignee: Dhruve Ashar
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> Spark is unable to read small files using the wholeTextFiles method when split size related
configs are specified - either explicitly or if they are contained in other config files like
hive-site.xml.
> For small sized files, the computed maxSplitSize by `WholeTextFileInputFormat`  is
way smaller than the default or commonly used split size of 64/128M and spark throws an exception
while trying to read them.  
>  
> To reproduce the issue: 
> {code:java}
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf "spark.hadoop.mapreduce.input.fileinputformat.split.minsize.per.node=123456"
> scala> sc.wholeTextFiles("file:///etc/passwd").count
> java.io.IOException: Minimum split size pernode 123456 cannot be larger than maximum
split size 9962
> at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
> at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
> ... 48 elided
> // For hdfs
> sc.wholeTextFiles("smallFile").count
> java.io.IOException: Minimum split size pernode 123456 cannot be larger than maximum
split size 15
> at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
> at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
> ... 48 elided{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message