drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rahul Challapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5604) Possible performance degradation with hash aggregate when number of distinct keys increase
Date Fri, 23 Jun 2017 18:38:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061360#comment-16061360
] 

Rahul Challapalli commented on DRILL-5604:
------------------------------------------

Physical plan for both the queries looks identical
{code}
00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {9.820759078689999E9
rows, 4.6137407764079994E10 cpu, 0.0 io, 2.3592861728768E12 network, 5.575656766064E10 memory},
id = 163643
00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative
cost = {9.820759078589998E9 rows, 4.6137407763979996E10 cpu, 0.0 io, 2.3592861728768E12 network,
5.575656766064E10 memory}, id = 163642
00-02        StreamAgg(group=[{}], EXPR$0=[$SUM0($0)]) : rowType = RecordType(BIGINT EXPR$0):
rowcount = 1.0, cumulative cost = {9.820759078589998E9 rows, 4.6137407763979996E10 cpu, 0.0
io, 2.3592861728768E12 network, 5.575656766064E10 memory}, id = 163641
00-03          UnionExchange : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative
cost = {9.820759077589998E9 rows, 4.6137407751979996E10 cpu, 0.0 io, 2.3592861728768E12 network,
5.575656766064E10 memory}, id = 163640
01-01            StreamAgg(group=[{}], EXPR$0=[COUNT($0)]) : rowType = RecordType(BIGINT EXPR$0):
rowcount = 1.0, cumulative cost = {9.820759076589998E9 rows, 4.6137407743979996E10 cpu, 0.0
io, 2.3592861687808E12 network, 5.575656766064E10 memory}, id = 163639
01-02              HashAgg(group=[{0}]) : rowType = RecordType(DOUBLE ss_list_price): rowcount
= 2.879987999E7, cumulative cost = {9.791959196599998E9 rows, 4.57918091841E10 cpu, 0.0 io,
2.3592861687808E12 network, 5.575656766064E10 memory}, id = 163638
01-03                Project(ss_list_price=[$0]) : rowType = RecordType(DOUBLE ss_list_price):
rowcount = 2.879987999E8, cumulative cost = {9.503960396699999E9 rows, 4.34878187849E10 cpu,
0.0 io, 2.3592861687808E12 network, 5.06877887824E10 memory}, id = 163637
01-04                  HashToRandomExchange(dist0=[[$0]]) : rowType = RecordType(DOUBLE ss_list_price,
ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 2.879987999E8, cumulative cost = {9.503960396699999E9
rows, 4.34878187849E10 cpu, 0.0 io, 2.3592861687808E12 network, 5.06877887824E10 memory},
id = 163636
02-01                    UnorderedMuxExchange : rowType = RecordType(DOUBLE ss_list_price,
ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 2.879987999E8, cumulative cost = {9.2159615968E9
rows, 3.88798379865E10 cpu, 0.0 io, 0.0 network, 5.06877887824E10 memory}, id = 163635
03-01                      Project(ss_list_price=[$0], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0,
1301011)]) : rowType = RecordType(DOUBLE ss_list_price, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount
= 2.879987999E8, cumulative cost = {8.9279627969E9 rows, 3.85918391866E10 cpu, 0.0 io, 0.0
network, 5.06877887824E10 memory}, id = 163634
03-02                        HashAgg(group=[{0}]) : rowType = RecordType(DOUBLE ss_list_price):
rowcount = 2.879987999E8, cumulative cost = {8.639963997E9 rows, 3.7439843987E10 cpu, 0.0
io, 0.0 network, 5.06877887824E10 memory}, id = 163633
03-03                          Project(ss_list_price=[CAST($0):DOUBLE]) : rowType = RecordType(DOUBLE
ss_list_price): rowcount = 2.879987999E9, cumulative cost = {5.759975998E9 rows, 1.4399939995E10
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 163632
03-04                            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=maprfs:///drill/testdata/tpcds/parquet/sf1000/store_sales]], selectionRoot=maprfs:/drill/testdata/tpcds/parquet/sf1000/store_sales,
numFiles=1, usedMetadataFile=false, columns=[`ss_list_price`]]]) : rowType = RecordType(ANY
ss_list_price): rowcount = 2.879987999E9, cumulative cost = {2.879987999E9 rows, 2.879987999E9
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 163631
{code}

> Possible performance degradation with hash aggregate when number of distinct keys increase
> ------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5604
>                 URL: https://issues.apache.org/jira/browse/DRILL-5604
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.11.0
>            Reporter: Rahul Challapalli
>
> git.commit.id.abbrev=90f43bf
> I tried to track the runtime as we gradually increase the no of distinct keys without
increasing the total no of records. Below is one such test on top of tpcds sf1000 dataset
> {code}
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_list_price) from store_sales;
> +---------+
> | EXPR$0  |
> +---------+
> | 19736   |
> +---------+
> 1 row selected (163.345 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_net_profit) from store_sales;
> +----------+
> |  EXPR$0  |
> +----------+
> | 1525675  |
> +----------+
> 1 row selected (2094.962 seconds)
> {code}
> In both the above queries, the hash agg code processed 2879987999 records. So the time
difference is due to overheads like hash table resizing etc. The second query took ~30 mins
more than the first raising doubts whether there is an issue somewhere.
> The dataset is too large to attach to a jira and so are the logs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message