hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8722) Enhance InputSplitShims to extend InputSplitWithLocationInfo [Spark Branch]
Date Thu, 18 Dec 2014 13:25:13 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251643#comment-14251643
] 

Rui Li commented on HIVE-8722:
------------------------------

Never mind my last comments. That's because I used hadoop-2.4 which doesn't have that class.

> Enhance InputSplitShims to extend InputSplitWithLocationInfo [Spark Branch]
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-8722
>                 URL: https://issues.apache.org/jira/browse/HIVE-8722
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Jimmy Xiang
>
> We got thie following exception in hive.log:
> {noformat}
> 2014-11-03 11:45:49,865 DEBUG rdd.HadoopRDD
> (Logging.scala:logDebug(84)) - Failed to use InputSplitWithLocations.
> java.lang.ClassCastException: Cannot cast
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit
> to org.apache.hadoop.mapred.InputSplitWithLocationInfo
>         at java.lang.Class.cast(Class.java:3094)
>         at org.apache.spark.rdd.HadoopRDD.getPreferredLocations(HadoopRDD.scala:278)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:215)
>         at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1303)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1313)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1312)
> {noformat}
> My understanding is that the split location info helps Spark to execute tasks more efficiently.
This could help other execution engine too. So we should consider to enhance InputSplitShim
to implement InputSplitWithLocationInfo if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message