drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (DRILL-5011) External Sort Batch memory use depends on record width
Date Wed, 09 Nov 2016 17:01:58 GMT

     [ https://issues.apache.org/jira/browse/DRILL-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Paul Rogers reassigned DRILL-5011:

    Assignee: Paul Rogers

> External Sort Batch memory use depends on record width
> ------------------------------------------------------
>                 Key: DRILL-5011
>                 URL: https://issues.apache.org/jira/browse/DRILL-5011
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
> The ExternalSortBatch operator uses spill-to-disk to keep memory needs within a defined
limit. However, the "copier" (really, the merge operation) can use an amount of memory determined
not by the operator configuration, but by the width of each record.
> The copier memory limit appears to be set by the COPIER_BATCH_MEM_LIMIT value.
> However, the actual memory use is determined by the number of records that the copier
is asked to copy. That record comes from an estimate of row width based on the type of each
column. Note that the row width *is not* based on the actual data in each row. Varchar fields,
for example, are assumed to be 40 characters wide. If the sorter is asked to sort records
with Varchar fields of, say, 1000 characters, then the row width estimate will be a poor estimator
of actual width.
> Memory use is based on a
> {code}
> target record count = memory limit / estimate row width
> {code}
> Actual memory use is:
> {code}
> memory use = target row count * actual row width
> {code}
> Which is
> {code}
> memory use = memory limit * actual row width  / estimate row width
> {code}
> That is, memory use depends on the ratio of actual to estimated width. If the estimate
is off by 2, then we use twice as much memory as expected.
> Not that the memory used for the copier defaults to 20 MB, so even an error of 4x still
means only 80 MB of memory used; small in comparison to the many GB typically allocated to
ESB storage.

This message was sent by Atlassian JIRA

View raw message