hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy Xiang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8722) Enhance InputSplitShims to extend InputSplitWithLocationInfo [Spark Branch]
Date Fri, 19 Dec 2014 03:11:13 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252845#comment-14252845
] 

Jimmy Xiang commented on HIVE-8722:
-----------------------------------

Without location info, it works too. I was wondering if location info helps Spark to work
more efficiently?

> Enhance InputSplitShims to extend InputSplitWithLocationInfo [Spark Branch]
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-8722
>                 URL: https://issues.apache.org/jira/browse/HIVE-8722
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Jimmy Xiang
>
> We got thie following exception in hive.log:
> {noformat}
> 2014-11-03 11:45:49,865 DEBUG rdd.HadoopRDD
> (Logging.scala:logDebug(84)) - Failed to use InputSplitWithLocations.
> java.lang.ClassCastException: Cannot cast
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit
> to org.apache.hadoop.mapred.InputSplitWithLocationInfo
>         at java.lang.Class.cast(Class.java:3094)
>         at org.apache.spark.rdd.HadoopRDD.getPreferredLocations(HadoopRDD.scala:278)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:215)
>         at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1303)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1313)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1312)
> {noformat}
> My understanding is that the split location info helps Spark to execute tasks more efficiently.
This could help other execution engine too. So we should consider to enhance InputSplitShim
to implement InputSplitWithLocationInfo if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message