hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brock Noland (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
Date Wed, 17 Dec 2014 19:33:13 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250373#comment-14250373
] 

Brock Noland commented on HIVE-9127:
------------------------------------

bq. In looking into HIVE-9135, I was wondering if it is better to fix the root cause of HIVE-7431
instead disabling the cache for Spark.

I think that would be awesome. I think we disabled it early on when we were just trying to
get HOS working.

bq. If so, probably we don't need this work around?

I think this "work around" results in better code generally. In CombineHiveInputFormat we
were looking up the partition information on each loop iteration but with this fix we do it
once before the loop, which is generally better.

> Improve CombineHiveInputFormat.getSplit performance
> ---------------------------------------------------
>
>                 Key: HIVE-9127
>                 URL: https://issues.apache.org/jira/browse/HIVE-9127
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>    Affects Versions: 0.14.0
>            Reporter: Brock Noland
>            Assignee: Brock Noland
>         Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt
>
>
> In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However,
we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
how this impacts performance.
> Caller ST:
> {noformat}
> ....
> 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
> 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
> 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
> 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
> 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at scala.Option.getOrElse(Option.scala:120)
> 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at scala.Option.getOrElse(Option.scala:120)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:79)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at scala.Option.getOrElse(Option.scala:120)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:313)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:247)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:735)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1382)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2014-12-16 14:36:22,204 INFO  [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435))
-        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message