drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-5528) Sorting 19GB data with 14GB memory in a single fragment takes ~150 minutes
Date Thu, 08 Jun 2017 23:38:18 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043607#comment-16043607
] 

Paul Rogers edited comment on DRILL-5528 at 6/8/17 11:37 PM:
-------------------------------------------------------------

Preliminary results suggest that Drill is severely CPU bound for this query. Running on a
Mac, the thread running the query is at 100% of CPU. But, disk I/O during the read-sort-and-spill
phase is very small, only about 40MB/s for reads, 20 MB/s for writes peak, with many valleys
in between -- on an SSD that can do 700 MB/s. Earlier tests got upwards of 70+ MB/s.

Later, during the merge phase, read performance jumps to 150 MB/s. CPU is still around 95%.

Total run time was just over an hour for a file that formerly took just a few minutes on previous
test variations.

This suggests that, somehow, something is using far to much CPU to do the work of the sort.
Will have to investigate. Note, the above is on the DRILL-5325 branch, not master.

Time summary:

{code}
Results: 1 records, 2 batches, 3,608,140 ms

00 Screen
  Setup:        0 ms -   0%,   0%
  Process:      4 ms -   0%,   0%
01 Project
  Setup:        5 ms -   0%,   0%
  Process:      1 ms -   0%,   0%
02 StreamAgg (STREAMING AGGREGATE)
  Setup:      133 ms -  11%,   0%
  Process:      4 ms -   0%,   0%
03 Project
  Setup:       53 ms -   4%,   0%
  Process:      0 ms -   0%,   0%
04 SelectionVectorRemover (SELECTION VECTOR REMOVER)
  Setup:       51 ms -   4%,   0%
  Process:    501 ms -   0%,   0%
05 Filter
  Setup:      194 ms -  17%,   0%
  Process: 14,910 ms -   0%,   0%
06 Project
  Setup:       41 ms -   3%,   0%
  Process:  1,414 ms -   0%,   0%
07 SelectionVectorRemover (SELECTION VECTOR REMOVER)
  Setup:        6 ms -   0%,   0%
  Process: 33,020 ms -   1%,   1%
08 Sort (EXTERNAL SORT)
  Setup:        0 ms -   0%,   0%
  Process: 1,996,522 ms -  87%,  87%
09 Project
  Setup:      137 ms -  12%,   0%
  Process: 24,847 ms -   1%,   1%
10 Project
  Setup:      499 ms -  44%,   0%
  Process: 45,571 ms -   2%,   2%
11 Scan (TEXT SUB SCAN)
  Setup:        0 ms -   0%,   0%
  Process: 153,725 ms -   6%,   6%
Total:
  Setup:    1,119 ms
  Process: 2,270,519 ms
Input batches: 0, spills: 58, merge/spills: 0
{code}

In summary form:

{code}
08 Sort (EXTERNAL SORT)
  Process: 1,996,522 ms -  87%,  87%
10 Project
  Process: 45,571 ms -   2%,   2%
11 Scan (TEXT SUB SCAN)
  Process: 153,725 ms -   6%,   6%
{code}

The sort itself takes 87% of the time. Reading the original 18 GB of data takes just 6% of
the time. The read rate during the scan is about 18 GB / 154 s = 116 MB/s, which is pretty
good. So, clearly the delay is in the sort itself somewhere.



was (Author: paul-rogers):
Preliminary results suggest that Drill is severely CPU bound for this query. Running on a
Mac, the thread running the query is at 100% of CPU. But, disk I/O during the read-sort-and-spill
phase is very small, only about 40MB/s for reads, 20 MB/s for writes peak, with many valleys
in between -- on an SSD that can do 700 MB/s. Earlier tests got upwards of 70+ MB/s.

Later, during the merge phase, read performance jumps to 150 MB/s. CPU is still around 95%.

This suggests that, somehow, something is using far to much CPU to do the work of the sort.
Will have to investigate. Note, the above is on the DRILL-5325 branch, not master.

The original goal of the "managed sort" project was just to manage the memory used by the
sort. Fixing performance issues, while important, starts wandering a bit off of the original
goals...

> Sorting 19GB data with 14GB memory in a single fragment takes ~150 minutes
> --------------------------------------------------------------------------
>
>                 Key: DRILL-5528
>                 URL: https://issues.apache.org/jira/browse/DRILL-5528
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Paul Rogers
>         Attachments: 26e2183f-2c5d-4044-0e23-4aeb45c40be5.sys.drill, drillbit.log, drillbit.out,
drill-env.sh
>
>
> Configuration :
> {code}
> git.commit.id.abbrev=1e0a14c
> DRILL_MAX_DIRECT_MEMORY="32G"
> DRILL_MAX_HEAP="4G"
> {code}
> Based on the runtime of the below query, I suspect there is a performance bottleneck
somewhere
> {code}
> [root@qa-node190 external-sort]# /opt/drill/bin/sqlline -u jdbc:drill:zk=10.10.100.190:5181
> apache drill 1.11.0-SNAPSHOT
> "start your sql engine"
> 0: jdbc:drill:zk=10.10.100.190:5181> ALTER SESSION SET `exec.sort.disable_managed`
= false;
> +-------+-------------------------------------+
> |  ok   |               summary               |
> +-------+-------------------------------------+
> | true  | exec.sort.disable_managed updated.  |
> +-------+-------------------------------------+
> 1 row selected (0.975 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> alter session set `planner.width.max_per_node`
= 1;
> +-------+--------------------------------------+
> |  ok   |               summary                |
> +-------+--------------------------------------+
> | true  | planner.width.max_per_node updated.  |
> +-------+--------------------------------------+
> 1 row selected (0.371 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> alter session set `planner.disable_exchanges`
= true;
> +-------+-------------------------------------+
> |  ok   |               summary               |
> +-------+-------------------------------------+
> | true  | planner.disable_exchanges updated.  |
> +-------+-------------------------------------+
> 1 row selected (0.292 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> alter session set `planner.memory.max_query_memory_per_node`
= 14106127360;
> +-------+----------------------------------------------------+
> |  ok   |                      summary                       |
> +-------+----------------------------------------------------+
> | true  | planner.memory.max_query_memory_per_node updated.  |
> +-------+----------------------------------------------------+
> 1 row selected (0.316 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from (select * from dfs.`/drill/testdata/resource-manager/250wide.tbl`
order by columns[0])d where d.columns[0] = 'ljdfhwuehnoiueyf';
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> 1 row selected (8530.719 seconds)
> {code}
> I attached the logs and profile files. The data is too large to attach to a jira. Reach
out to me if you need any more information



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message