hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-7958) SparkWork generated by SparkCompiler may require multiple Spark jobs to run
Date Wed, 03 Sep 2014 15:55:52 GMT
Xuefu Zhang created HIVE-7958:
---------------------------------

             Summary: SparkWork generated by SparkCompiler may require multiple Spark jobs
to run
                 Key: HIVE-7958
                 URL: https://issues.apache.org/jira/browse/HIVE-7958
             Project: Hive
          Issue Type: Bug
          Components: Spark
            Reporter: Xuefu Zhang
            Priority: Critical


A SparkWork instance currently may contain disjointed work graphs. For instance, union_remove_1.q
may generated a plan like this:
{code}
Reduce2 -> Map 1
Reduce4 <- Map 3
{code}
The SparkPlan instance generated from this work graph contains two result RDDs. When such
plan is executed, we call .foreach() on the two RDDs sequentially, which results two Spark
jobs, one after the other.

While this works functionally, the performance will not be great as the Spark jobs are run
sequentially rather than concurrently.

Another side effect of this is that the corresponding SparkPlan instance is over-complicated.

The are two potential approaches:

1. Let SparkCompiler generate a work that can be executed in ONE Spark job only. In above
example, two Spark task should be generated.

2. Let SparkPlanGenerate generate multiple Spark plans and then SparkClient executes them
concurrently.

Approach #1 seems more reasonable and naturally fit to our architecture. Also, Hive's task
execution framework already takes care of the task concurrency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message