drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5209) Standardize Drill's batch size
Date Sun, 22 Jan 2017 02:39:27 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833242#comment-15833242

Paul Rogers commented on DRILL-5209:

See DRILL-5211. It turns out that Drill uses a memory allocation scheme that caches blocks
of 16 MB. If any single vector allocation is larger than this amount, Drill must allocate
memory directly from the JVM. The result is that Drill can hit OOM due to memory fragmentation:
plenty of memory exists as 16 MB blocks, but none at larger sizes.

As a result, every batch must be aware not just of row width, but also of _column_ width.
No batch may have more rows than  fills any given column vector above 16 MB. This logic does
not exist anywhere in Drill today. As noted above, we instead allocate based on aggregate
batch totals or row counts, leaving us susceptible to memory fragmentation and no good way
to avoid the problem.

> Standardize Drill's batch size
> ------------------------------
>                 Key: DRILL-5209
>                 URL: https://issues.apache.org/jira/browse/DRILL-5209
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Paul Rogers
>            Priority: Minor
> Drill is columnar, implemented as a set of value vectors. Value vectors consume memory,
which is a fixed resource on each Drillbit. Effective resource management requires the ability
to control (or at least predict) resource usage.
> Most data consists of more than one column. A collection of columns (or rows, depending
on your perspective) is a record batch.
> Many parts of Drill use 64K rows as the target size of a record batch. The Flatten operator
targets batch sizes of 512 MB. The text scan operator appears to target batch sizes of 128
MB. Other operators may use other sizes.
> Operators that target 64K rows use, essentially, unknown and potentially unlimited amounts
of memory. While 64K rows of an integer each is fine, 64K rows of Varchar columns of 50K each
leads to a batch of 3.2 GB in size, which is rather large.
> This ticket requests three improvements.
> 1. Define a preferred batch size which is a balance between various needs: memory use,
network efficiency, benefits of vector operations, etc.
> 2. Provide a reliable way to learn the size of each row as it is added to a batch.
> 3. Use the above to limit batches to the preferred batch size.
> The above will go a long way to easing the task of managing memory because the planner
will have some hope of understanding how much memory to allocate to various operations.

This message was sent by Atlassian JIRA

View raw message