drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chun Chang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5138) TopN operator on top of ~110 GB data set is very slow
Date Tue, 07 Nov 2017 19:27:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242709#comment-16242709
] 

Chun Chang commented on DRILL-5138:
-----------------------------------

I ran the query against MapR Drill 1.11.0 and query returned in 81 seconds.

[root@perfnode166 catalog_sales]# sqlline --maxWidth=10000 -u "jdbc:drill:zk=10.10.30.166:5181"
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in
8.0
apache drill 1.11.0-mapr
"drill baby drill"
0: jdbc:drill:zk=10.10.30.166:5181> select * from dfs.`/drill/testdata/tpcds_sf100/parquet/catalog_sales`
order by cs_quantity, cs_wholesale_cost limit 1;
+------------------+-------------------+----------------------+-------------------+--------------------+---------------------+----------------+----------------------+--------------------+---------------------+-------------------+-------------+------------------------+-------------+----------------+--------------+-----------------------+---------------------------+----------------------+----------------+------------------+--------------+--------------+-----------------+------------------+-------------------+----------------------+------------------+-------------------+------------------+------------------+------------------+------------------+--------------------+
| cs_bill_addr_sk  | cs_bill_cdemo_sk  | cs_bill_customer_sk  | cs_bill_hdemo_sk  | cs_call_center_sk
 | cs_catalog_page_sk  | cs_coupon_amt  | cs_ext_discount_amt  | cs_ext_list_price  | cs_ext_sales_price
 | cs_ext_ship_cost  | cs_ext_tax  | cs_ext_wholesale_cost  | cs_item_sk  | cs_list_price
 | cs_net_paid  | cs_net_paid_inc_ship  | cs_net_paid_inc_ship_tax  | cs_net_paid_inc_tax
 | cs_net_profit  | cs_order_number  | cs_promo_sk  | cs_quantity  | cs_sales_price  | cs_ship_addr_sk
 | cs_ship_cdemo_sk  | cs_ship_customer_sk  | cs_ship_date_sk  | cs_ship_hdemo_sk  | cs_ship_mode_sk
 | cs_sold_date_sk  | cs_sold_time_sk  | cs_warehouse_sk  | cs_wholesale_cost  |

| 184649           | 555979            | 1796891              | 1114              | 24   
             | 14393               | 0.00           | 0.02                 | 1.82        
      | 1.80                | 0.25              | 0.00        | 1.00                   | 108618
     | 1.82           | 1.80         | 2.05                  | 2.05                      |
1.80                 | 0.80           | 15928478         | 540          | 1            | 1.80
           | 184649           | 555979            | 1796891              | 2452671       
  | 1114              | 9                | 2452640          | 38871            | 1       
        | 1.00               |

1 row selected (81.577 seconds)

> TopN operator on top of ~110 GB data set is very slow
> -----------------------------------------------------
>
>                 Key: DRILL-5138
>                 URL: https://issues.apache.org/jira/browse/DRILL-5138
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Rahul Challapalli
>            Assignee: Timothy Farkas
>
> git.commit.id.abbrev=cf2b7c7
> No of cores : 23
> No of disks : 5
> DRILL_MAX_DIRECT_MEMORY="24G"
> DRILL_MAX_HEAP="12G"
> The below query ran for more than 4 hours and did not complete. The table is ~110 GB
> {code}
> select * from catalog_sales order by cs_quantity, cs_wholesale_cost limit 1;
> {code}
> Physical Plan :
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 1.0, cumulative cost = {1.00798629141E10
rows, 4.17594320691E10 cpu, 0.0 io, 4.1287118487552E13 network, 0.0 memory}, id = 352
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 1.0, cumulative
cost = {1.0079862914E10 rows, 4.1759432069E10 cpu, 0.0 io, 4.1287118487552E13 network, 0.0
memory}, id = 351
> 00-02        Project(T0¦¦*=[$0]) : rowType = RecordType(ANY T0¦¦*): rowcount = 1.0,
cumulative cost = {1.0079862914E10 rows, 4.1759432069E10 cpu, 0.0 io, 4.1287118487552E13 network,
0.0 memory}, id = 350
> 00-03          SelectionVectorRemover : rowType = RecordType(ANY T0¦¦*, ANY cs_quantity,
ANY cs_wholesale_cost): rowcount = 1.0, cumulative cost = {1.0079862914E10 rows, 4.1759432069E10
cpu, 0.0 io, 4.1287118487552E13 network, 0.0 memory}, id = 349
> 00-04            Limit(fetch=[1]) : rowType = RecordType(ANY T0¦¦*, ANY cs_quantity,
ANY cs_wholesale_cost): rowcount = 1.0, cumulative cost = {1.0079862913E10 rows, 4.1759432068E10
cpu, 0.0 io, 4.1287118487552E13 network, 0.0 memory}, id = 348
> 00-05              SingleMergeExchange(sort0=[1 ASC], sort1=[2 ASC]) : rowType = RecordType(ANY
T0¦¦*, ANY cs_quantity, ANY cs_wholesale_cost): rowcount = 1.439980416E9, cumulative cost
= {1.0079862912E10 rows, 4.1759432064E10 cpu, 0.0 io, 4.1287118487552E13 network, 0.0 memory},
id = 347
> 01-01                SelectionVectorRemover : rowType = RecordType(ANY T0¦¦*, ANY cs_quantity,
ANY cs_wholesale_cost): rowcount = 1.439980416E9, cumulative cost = {8.639882496E9 rows, 3.0239588736E10
cpu, 0.0 io, 2.3592639135744E13 network, 0.0 memory}, id = 346
> 01-02                  TopN(limit=[1]) : rowType = RecordType(ANY T0¦¦*, ANY cs_quantity,
ANY cs_wholesale_cost): rowcount = 1.439980416E9, cumulative cost = {7.19990208E9 rows, 2.879960832E10
cpu, 0.0 io, 2.3592639135744E13 network, 0.0 memory}, id = 345
> 01-03                    Project(T0¦¦*=[$0], cs_quantity=[$1], cs_wholesale_cost=[$2])
: rowType = RecordType(ANY T0¦¦*, ANY cs_quantity, ANY cs_wholesale_cost): rowcount = 1.439980416E9,
cumulative cost = {5.759921664E9 rows, 2.879960832E10 cpu, 0.0 io, 2.3592639135744E13 network,
0.0 memory}, id = 344
> 01-04                      HashToRandomExchange(dist0=[[$1]], dist1=[[$2]]) : rowType
= RecordType(ANY T0¦¦*, ANY cs_quantity, ANY cs_wholesale_cost, ANY E_X_P_R_H_A_S_H_F_I_E_L_D):
rowcount = 1.439980416E9, cumulative cost = {5.759921664E9 rows, 2.879960832E10 cpu, 0.0 io,
2.3592639135744E13 network, 0.0 memory}, id = 343
> 02-01                        UnorderedMuxExchange : rowType = RecordType(ANY T0¦¦*,
ANY cs_quantity, ANY cs_wholesale_cost, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 1.439980416E9,
cumulative cost = {4.319941248E9 rows, 1.1519843328E10 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 342
> 03-01                          Project(T0¦¦*=[$0], cs_quantity=[$1], cs_wholesale_cost=[$2],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($2, hash32AsDouble($1))]) : rowType = RecordType(ANY
T0¦¦*, ANY cs_quantity, ANY cs_wholesale_cost, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount
= 1.439980416E9, cumulative cost = {2.879960832E9 rows, 1.0079862912E10 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 341
> 03-02                            Project(T0¦¦*=[$0], cs_quantity=[$1], cs_wholesale_cost=[$2])
: rowType = RecordType(ANY T0¦¦*, ANY cs_quantity, ANY cs_wholesale_cost): rowcount = 1.439980416E9,
cumulative cost = {1.439980416E9 rows, 4.319941248E9 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 340
> 03-03                              Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=maprfs:///drill/testdata/tpcds/parquet/sf1000/catalog_sales]], selectionRoot=maprfs:/drill/testdata/tpcds/parquet/sf1000/catalog_sales,
numFiles=1, usedMetadataFile=false, columns=[`*`]]]) : rowType = (DrillRecordRow[*, cs_quantity,
cs_wholesale_cost]): rowcount = 1.439980416E9, cumulative cost = {1.439980416E9 rows, 4.319941248E9
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 339
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message