hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-7503) Support Hive's multi-table insert query with Spark
Date Tue, 05 Aug 2014 21:17:12 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14082045#comment-14082045
] 

Xuefu Zhang edited comment on HIVE-7503 at 8/5/14 9:15 PM:
-----------------------------------------------------------

As it's unlikely to get SPARK-2688 in a short period of time and since HIVE-7525 has shown
that we can submit Spark job concurrently, I'd like to propose the following backup plan:

1. The multi-insert plan can be decomposed into 1 + N plans, where N is the number of inserts.
2. The 1 plan is the one generating the data source for all the inserts. The plan may not
be necessary if the source is a table, but in general, the source coming from a job.
3. For each insert, there will be one plan to insert the data. Thus, N inserts equals to N
plans. The source to the plans is the data generated from the 1 job.
4. We first run the 1 plan as a Spark job, which emits the data source.
5. Then, we call checkpoint on the rdd from #4.
6. Lastly, we launch N jobs concurrently, each with the above checkpointed rdd as input. 

While not ideal and probably not efficient, this approach should performs better than running
1 + N job sequentially.

The idea was from [~sowen], and verified by [~csun] via HIVE-7525.

[~rxin], what's your thought on this approach? Do you have any other suggestions?



was (Author: xuefuz):
As it's unlikely to get SPARK-2688 in a short period of time and since HIVE-7525 has shown
that we can submit Spark job concurrently, I'd like to propose the following backup plan:

1. The multi-insert plan can be decomposed into 1 + N plans, where N is the number of inserts.
2. The 1 plan is the one generating the data source for all the inserts. The plan may not
be necessary if the source is a table, but in general, the source coming from a job.
3. For each insert, there will be one plan to insert the data. Thus, N inserts equals to N
plans. The source to the plans is the data generated from the 1 job.
4. We first run the 1 plan as a Spark job, which emits the data source.
5. Then, we cache the data source via RDD.cache().
6. Lastly, we launch N jobs concurrently, each with the above cached RDD as input. 

While not ideal and probably not efficient, this approach should performs better than running
1 + N job sequentially.

The idea was from [~sowen], and verified by [~csun] via HIVE-7525.

[~rxin], what's your thought on this approach? Do you have any other suggestions?


> Support Hive's multi-table insert query with Spark
> --------------------------------------------------
>
>                 Key: HIVE-7503
>                 URL: https://issues.apache.org/jira/browse/HIVE-7503
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Chao
>
> For Hive's multi insert query (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML),
there may be an MR job for each insert.  When we achieve this with Spark, it would be nice
if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things worse, the
source of the insert may be re-computed unless it's staged. Even with this, the inserts will
happen sequentially, making the performance suffer.
> This task is to find out what takes in Spark to enable this without requiring staging
the source and sequential insertion. If this has to be solved in Hive, find out an optimum
way to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message