hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang_intel (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases
Date Mon, 08 May 2017 03:27:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000206#comment-16000206
] 

liyunzhang_intel edited comment on HIVE-16600 at 5/8/17 3:26 AM:
-----------------------------------------------------------------

[~lirui], [~xuefuz]: I update HIVE-16600.1.patch. help review
{noformat}
in the case i provided in description, after HIVE-16600.1.patch, the parallelism of RS[2]
(Sort)is 46. without HIVE-16600.1.patch, the parallelism of RS[2](Sort) is 1.

before HIVE-16600.1.patch
the parallelism of RS[2] is 1
#grep SetSparkReducerParallelism logs/hive.log
2017-05-08T10:31:10,820  INFO [63ddd225-f012-4b14-9141-38597f94c85b main] spark.SetSparkReducerParallelism:
Number of reducers determined to be: 1

after HIVE-16600.1.patch
the parallelism of RS[2] is 46
#grep SetSparkReducerParallelism logs/hive.log
2017-05-08T10:22:49,432 DEBUG [42c701ac-380e-43e3-a3ab-f5aa7c2b55ee main] spark.SetSparkReducerParallelism:
Sibling RS[2] has stats: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats:
NONE
2017-05-08T10:23:10,403  INFO [42c701ac-380e-43e3-a3ab-f5aa7c2b55ee main] spark.SetSparkReducerParallelism:
Set parallelism for reduce sink RS[2] to: 46 (calculated)
{noformat}




was (Author: kellyzly):
[~lirui]: I update HIVE-16600.1.patch. help review
{noformat}
in the case i provided in description, after HIVE-16600.1.patch, the parallelism of RS[2]
(Sort)is 46. without HIVE-16600.1.patch, the parallelism of RS[2](Sort) is 1.
before HIVE-16600.1.patch
the parallelism of RS[2] is 1
#grep SetSparkReducerParallelism logs/hive.log
2017-05-08T10:31:10,820  INFO [63ddd225-f012-4b14-9141-38597f94c85b main] spark.SetSparkReducerParallelism:
Number of reducers determined to be: 1

after HIVE-16600.1.patch
the parallelism of RS[2] is 46
#grep SetSparkReducerParallelism logs/hive.log
2017-05-08T10:22:49,432 DEBUG [42c701ac-380e-43e3-a3ab-f5aa7c2b55ee main] spark.SetSparkReducerParallelism:
Sibling RS[2] has stats: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats:
NONE
2017-05-08T10:23:10,403  INFO [42c701ac-380e-43e3-a3ab-f5aa7c2b55ee main] spark.SetSparkReducerParallelism:
Set parallelism for reduce sink RS[2] to: 46 (calculated)
{noformat}



> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in
multi_insert cases
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16600
>                 URL: https://issues.apache.org/jira/browse/HIVE-16600
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-16600.1.patch
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
>     SELECT key, value
> INSERT OVERWRITE TABLE e2
>     SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order by("hive.optimize.sampling.orderby"
is set as "true").  This is not reasonable because the parallelism  should be calcuated by
 [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false when [children
size of RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>    TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
>                             -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message