Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Tue, 9 May 2017 04:00:06 +0000 (UTC)
From: "Rui Li (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13069591.1494019575000.159827.1494302406655@Atlassian.JIRA>
In-Reply-To: <JIRA.13069591.1494019575000@Atlassian.JIRA>
References: <JIRA.13069591.1494019575000@Atlassian.JIRA> <JIRA.13069591.1494019575163@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HIVE-16600) Refactor
 SetSparkReducerParallelism#needSetParallelism to enable parallel order by
 in multi_insert cases
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Tue, 09 May 2017 04:00:11 -0000


    [ https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001991#comment-16001991 ] 

Rui Li commented on HIVE-16600:
-------------------------------

Thanks [~kellyzly] for working on this. I don't think test failures are related.
I remember you mentioned there was something wrong when you first enable parallel order by for multi insert. Have you figured out what was the cause?
Besides, please add a qtest for this. I think it should cover simple multi insert, as well as multi insert + limit.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16600
>                 URL: https://issues.apache.org/jira/browse/HIVE-16600
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-16600.1.patch
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
>     SELECT key, value
> INSERT OVERWRITE TABLE e2
>     SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order by("hive.optimize.sampling.orderby" is set as "true").  This is not reasonable because the parallelism  should be calcuated by  [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false when [children size of RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207] is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>    TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
>                             -SEL[6]-FS[7]
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)