drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Farkas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5755) TOP_N_SORT operator does not free memory while running
Date Fri, 08 Sep 2017 20:42:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159298#comment-16159298
] 

Timothy Farkas commented on DRILL-5755:
---------------------------------------

The root cause of the issue is that there is a hyper batch which is the combination of a bunch
of upstream batches. This hyper batch is purged every N windows as dictated by the drill.exec.sort.purge.threshold.
There are two issues with this:

* *drill.exec.sort.purge.threshold* is currently ill defined because there is no default defined
for it.
* I don't agree with the design as it's laid out in https://issues.apache.org/jira/browse/DRILL-385
I don't see why we could make the priority queue hold the records themselves not just the
indices. It is more work, but if we did that we could eliminate the need for keeping a hyper
batch that needs to be periodically purged.

> TOP_N_SORT operator does not free memory while running
> ------------------------------------------------------
>
>                 Key: DRILL-5755
>                 URL: https://issues.apache.org/jira/browse/DRILL-5755
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.11.0
>            Reporter: Boaz Ben-Zvi
>            Assignee: Timothy Farkas
>         Attachments: 2658c253-20b6-db90-362a-139aae4a327e.sys.drill
>
>
>  The TOP_N_SORT operator should keep the top N rows while processing its input, and free
the memory used to hold all rows below the top N.
> For example, the following query uses a table with 125M rows:
> {code}
> select row_count, sum(row_count), avg(double_field), max(double_rand), count(float_rand)
from dfs.`/data/tmp` group by row_count order by row_count limit 30;
> {code}
> And failed with an OOM when each of the 3 TOP_N_SORT operators was holding about 2.44
GB !! (see attached profile).  It should take far less memory to hold 30 rows !!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message