drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-5014) ExternalSortBatch cache size, spill count differs from config setting
Date Tue, 08 Nov 2016 18:47:58 GMT

     [ https://issues.apache.org/jira/browse/DRILL-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul Rogers updated DRILL-5014:
-------------------------------
    Summary: ExternalSortBatch cache size, spill count differs from config setting  (was:
ExternalSortBatch spills only half the batches requested in config)

> ExternalSortBatch cache size, spill count differs from config setting
> ---------------------------------------------------------------------
>
>                 Key: DRILL-5014
>                 URL: https://issues.apache.org/jira/browse/DRILL-5014
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> The ExternalSortBatch (ESB) operator sorts data while spilling to disk to remain within
a defined memory footprint. Spilling occurs based on a number of factors. Among those are
two config parameters:
> * {{drill.exec.sort.external.spill.group.size}}: The number of batches to spill per spill
event.
> * {{drill.exec.sort.external.spill.threshold}}: The number of batches to accumulate in
memory before starting a spill event.
> The expected behavior would be:
> * When the accumulated number of batches exceeds the threshold, spill the requested number
of batches.
> That is if the threshold is 200, and the size is 150, we should accumulate 200 batches,
then spill 150 of them (leaving 50) and repeat.
> The actual behavior is:
> * When the accumulated records exceeds the threshold and more than "batch size" new batches
have arrived since the last spill,
> * Spill half the accumulated records.
> So, the actual behavior for the (threshold=200, size=150) case is:
> {code}
> Before spilling, buffered batch count: 201
> After spilling, buffered batch count: 101
> Before spilling, buffered batch count: 251
> After spilling, buffered batch count: 126
> Before spilling, buffered batch count: 276
> After spilling, buffered batch count: 138
> Before spilling, buffered batch count: 288
> After spilling, buffered batch count: 144
> Before spilling, buffered batch count: 294
> After spilling, buffered batch count: 147
> Before spilling, buffered batch count: 297
> After spilling, buffered batch count: 149
> Before spilling, buffered batch count: 299
> After spilling, buffered batch count: 150
> Before spilling, buffered batch count: 300
> After spilling, buffered batch count: 150
> Before spilling, buffered batch count: 300
> {code}
> In short, the actual number of batches retained in memory is twice the spill size, **NOT**
the number set in the threshold. As a result, the ESB operator will use more memory than expected.
> The work-around is to set a batch size that is half the threshold so that the batch size
(used in spill decisions) matches the actual spill count (as implemented by the code.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message