hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang" <xzh...@cloudera.com>
Subject Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]
Date Fri, 05 Sep 2014 18:18:55 GMT


> On Sept. 5, 2014, 5:59 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line 228
> > <https://reviews.apache.org/r/25394/diff/1/?file=680525#file680525line228>
> >
> >     What's the reason to remove this?
> 
> Chao Sun wrote:
>     This is an issue we encountered in HIVE-7870: with this line, context.fileSinkSet
will contain multiple duplicated fileSinks, which may then generate duplicated Move/Merge
tasks. It would be better to left it be solved in that JIRA. However, comment out this line
makes it easier since in {{GenSparkUtils::processFileSink}} I don't need to consider those
"fake" file sinks - they should not be in the {{opToTaskTable}}.
>     
>     I can also keep this line and change some other places. It's not a big issue.

Probably you can put comments or TODOs on this. Thanks for the explanation.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25394/#review52472
-----------------------------------------------------------


On Sept. 5, 2014, 6:18 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25394/
> -----------------------------------------------------------
> 
> (Updated Sept. 5, 2014, 6:18 p.m.)
> 
> 
> Review request for hive, Brock Noland and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7503
>     https://issues.apache.org/jira/browse/HIVE-7503
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> For Hive's multi insert query (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML),
there may be an MR job for each insert. When we achieve this with Spark, it would be nice
if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things worse, the
source of the insert may be re-computed unless it's staged. Even with this, the inserts will
happen sequentially, making the performance suffer.
> This task is to find out what takes in Spark to enable this without requiring staging
the source and sequential insertion. If this has to be solved in Hive, find out an optimum
way to do this.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 9c808d4 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 5ddc16d

>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 379a39c 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 864965e 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 76fc290 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java
PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java 5fcaf64

>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java PRE-CREATION

> 
> Diff: https://reviews.apache.org/r/25394/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message