hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8722) Enhance InputSplitShims to extend InputSplitWithLocationInfo [Spark Branch]
Date Fri, 19 Dec 2014 02:22:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252804#comment-14252804
] 

Rui Li commented on HIVE-8722:
------------------------------

I think spark doesn't require the input split to be a {{InputSplitWithLocationInfo}}:
{code}
    val locs: Option[Seq[String]] = HadoopRDD.SPLIT_INFO_REFLECTIONS match {
      case Some(c) =>
        try {
          val lsplit = c.inputSplitWithLocationInfo.cast(hsplit)
          val infos = c.getLocationInfo.invoke(lsplit).asInstanceOf[Array[AnyRef]]
          Some(HadoopRDD.convertSplitLocationInfo(infos))
        } catch {
          case e: Exception =>
            logDebug("Failed to use InputSplitWithLocations.", e)
            None
        }
      case None => None
    }
    locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))
{code}
If failed using {{InputSplitWithLocationInfo}}, it will try calling the {{getLocations}} method.
And {{CombineHiveInputSplit}} calls {{CombineFileSplit.getLocations}}.

> Enhance InputSplitShims to extend InputSplitWithLocationInfo [Spark Branch]
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-8722
>                 URL: https://issues.apache.org/jira/browse/HIVE-8722
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Jimmy Xiang
>
> We got thie following exception in hive.log:
> {noformat}
> 2014-11-03 11:45:49,865 DEBUG rdd.HadoopRDD
> (Logging.scala:logDebug(84)) - Failed to use InputSplitWithLocations.
> java.lang.ClassCastException: Cannot cast
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit
> to org.apache.hadoop.mapred.InputSplitWithLocationInfo
>         at java.lang.Class.cast(Class.java:3094)
>         at org.apache.spark.rdd.HadoopRDD.getPreferredLocations(HadoopRDD.scala:278)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:215)
>         at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1303)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1313)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1312)
> {noformat}
> My understanding is that the split location info helps Spark to execute tasks more efficiently.
This could help other execution engine too. So we should consider to enhance InputSplitShim
to implement InputSplitWithLocationInfo if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message