drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-5416) Vectors read from disk report incorrect memory sizes
Date Wed, 05 Apr 2017 23:17:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957987#comment-15957987
] 

Paul Rogers edited comment on DRILL-5416 at 4/5/17 11:17 PM:
-------------------------------------------------------------

The original design for serialization is that each vector serializes to a buffer. This is
simple for single-buffer vectors (required int, say). For composite vectors (nullable int,
Varchar), the serialization process results in all buffers being combined into a single write
buffer, and the corresponding read buffer being sliced into the individual composite vectors.

For a Varchar:
{code}
Data: [FredBarneyWilma_]
Offsets: [01041015]
Output buffer:  [01041015FredBarneyWilma_]
Input buffer:   [01041015FredBarneyWilma_]
New Offsets:    [^^^^^^^^]
New Data                [^^^^^^^^^^^^^^^]
{code}

Notice that, in the original, the empty space "denoted with _" is allocated per vector. After
serialization, free space is in a buffer shared by two vectors and is not "owned" by (or visible
to) either.


was (Author: paul-rogers):
The original design for serialization is that each vector serializes to a buffer. This is
simple for single-buffer vectors (required int, say). For composite vectors (nullable int,
Varchar), the serialization process results in all buffers being combined into a single write
buffer, and the corresponding read buffer being sliced into the individual composite vectors.

For a Varchar:
{code}
Data: [FredBarneyWilma_]
Offsets: [01041015]
Output buffer:  [01041015FredBarneyWilma_]
Input buffer:   [01041015FredBarneyWilma_]
New Offsets:    [^^^^^^^^]
New Data                [^^^^^^^^^^^^^^^]
{code}


> Vectors read from disk report incorrect memory sizes
> ----------------------------------------------------
>
>                 Key: DRILL-5416
>                 URL: https://issues.apache.org/jira/browse/DRILL-5416
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>             Fix For: 1.11.0
>
>
> The external sort and revised hash agg operators spill to disk using a vector serialization
mechanism. This mechanism serializes each vector as a (length, bytes) pair.
> Before spilling, if we check the memory used for a vector (using the new {{RecordBatchSizer}}
class), we learn of the actual memory consumed by the vector, including any unused space in
the vector.
> If we spill the vector, then reread it, the reported storage size is wrong.
> On reading, the code allocates a buffer, based on the saved length, rounded up to the
next power of two. Then, when building the vector, we "slice" the read buffer, setting the
memory size to the data size.
> For example, suppose we save 20 1-byte fields. The size on disk is 20. The read buffer
is rounded to 32 bytes (the size of the original, pre-spill buffer.) We read the 20 bytes
and create a vector. Creating the vector reports the memory size as 20, "hiding" the extra,
unused 12 bytes.
> As a result, when computing memory sizes, we receive incorrect numbers. Working with
false numbers means that the code cannot safely operate within a memory budget, causing the
user to receive an unexpected OOM error.
> As it turns out, the code path that does the slicing is used only for reads from disk.
This ticket asks to remove the slicing step: just use the allocated buffer directly so that
the after-read vector reports the correct memory usage; same as the before-spill vector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message