Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E54D710033 for ; Wed, 17 Dec 2014 19:33:14 +0000 (UTC) Received: (qmail 28135 invoked by uid 500); 17 Dec 2014 19:33:13 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 28074 invoked by uid 500); 17 Dec 2014 19:33:13 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 28061 invoked by uid 500); 17 Dec 2014 19:33:13 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 28058 invoked by uid 99); 17 Dec 2014 19:33:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Dec 2014 19:33:13 +0000 Date: Wed, 17 Dec 2014 19:33:13 +0000 (UTC) From: "Brock Noland (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250373#comment-14250373 ] Brock Noland commented on HIVE-9127: ------------------------------------ bq. In looking into HIVE-9135, I was wondering if it is better to fix the root cause of HIVE-7431 instead disabling the cache for Spark. I think that would be awesome. I think we disabled it early on when we were just trying to get HOS working. bq. If so, probably we don't need this work around? I think this "work around" results in better code generally. In CombineHiveInputFormat we were looking up the partition information on each loop iteration but with this fix we do it once before the loop, which is generally better. > Improve CombineHiveInputFormat.getSplit performance > --------------------------------------------------- > > Key: HIVE-9127 > URL: https://issues.apache.org/jira/browse/HIVE-9127 > Project: Hive > Issue Type: Sub-task > Components: Spark > Affects Versions: 0.14.0 > Reporter: Brock Noland > Assignee: Brock Noland > Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt > > > In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. > Caller ST: > {noformat} > .... > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at scala.Option.getOrElse(Option.scala:120) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at scala.Option.getOrElse(Option.scala:120) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.ShuffleDependency.(Dependency.scala:79) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at scala.Option.getOrElse(Option.scala:120) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:313) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:247) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:735) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1382) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at akka.actor.ActorCell.invoke(ActorCell.scala:487) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at akka.dispatch.Mailbox.run(Mailbox.scala:220) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > 2014-12-16 14:36:22,204 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) - at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)