hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8722) Enhance InputSplitShims to extend InputSplitWithLocationInfo [Spark Branch]
Date Fri, 19 Dec 2014 03:31:13 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252859#comment-14252859
] 

Rui Li commented on HIVE-8722:
------------------------------

Hi [~jxiang], yes I think data locality can have dramatic impact on performance. I saw nearly
2.5X difference in previous work (SPARK-1937).
But I think we don't have to make {{CombineHiveInputSplit}} as {{InputSplitWithLocationInfo}}
to get location info. {{CombineHiveInputSplit.getLocations}} can get what we need. {{InputSplitWithLocationInfo}}
is only an enhancement to make cached replicas appear first in the location list (SPARK-1767).

> Enhance InputSplitShims to extend InputSplitWithLocationInfo [Spark Branch]
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-8722
>                 URL: https://issues.apache.org/jira/browse/HIVE-8722
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Jimmy Xiang
>
> We got thie following exception in hive.log:
> {noformat}
> 2014-11-03 11:45:49,865 DEBUG rdd.HadoopRDD
> (Logging.scala:logDebug(84)) - Failed to use InputSplitWithLocations.
> java.lang.ClassCastException: Cannot cast
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit
> to org.apache.hadoop.mapred.InputSplitWithLocationInfo
>         at java.lang.Class.cast(Class.java:3094)
>         at org.apache.spark.rdd.HadoopRDD.getPreferredLocations(HadoopRDD.scala:278)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:215)
>         at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1303)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1313)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1312)
> {noformat}
> My understanding is that the split location info helps Spark to execute tasks more efficiently.
This could help other execution engine too. So we should consider to enhance InputSplitShim
to implement InputSplitWithLocationInfo if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message