drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5272) Text file reader is inefficient
Date Fri, 17 Feb 2017 05:06:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871177#comment-15871177

Paul Rogers commented on DRILL-5272:

Suppose we have a collection of small files. Each pass through the {{next()}} method will
accumulate data from one file. But, the way we handle the hand-off from one file to the next
is inefficient. On the first call:

* Allocate vectors for reader 1
* Call Reader 1 to read the rows.
* Hit EOF on reader 1
* Return batch for reader 1

On the second call:

* Allocate vectors for reader 1
* Call Reader 1 to read the rows.
* Hit EOF on first record
* Close reader 1
* Open reader 2
* Allocate vectors for reader 2
* Call Reader 2 to read the rows.
* Hit EOF on reader 2
* Return batch for reader 2

Note the extra steps. On the first call, since we got to EOF, we need not allocate vectors
for that reader again on the second call. When we do, we simply throw away the just-allocated
reader 1 vectors then reallocate vectors for reader 2.

> Text file reader is inefficient
> -------------------------------
>                 Key: DRILL-5272
>                 URL: https://issues.apache.org/jira/browse/DRILL-5272
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>            Priority: Minor
> From inspection of the ScanBatch and CompliantTextReader.
> Every batch holds about five implicit vectors. These are repeated for every row, which
can greatly increase incoming data size.
> When populating the vectors, the allocation starts at 8 bytes and grows to 16 bytes,
causing a (slow) memory reallocation for every vector:
> {code}
> [org.apache.drill.exec.vector.UInt4Vector] - 
> Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16]
> {code}
> Whether due to the above, or a different issues is causing memory growth in the scan
> {code}
> Entry Memory: 6,456,448
> Exit Memory: 7,636,312
> Entry Memory: 7570560
> Exit Memory: 8750424
> ...
> {code}
> Evidently the implicit vectors are added in response to a "SELECT *" query. Perhaps provide
them only if actually requested.
> The vectors are populated for every row, making a copy of a potentially long file name
and path for every record. Since the values are common to every record, perhaps we can use
the same data copy for each, but have the offset vector for each record just point to the
single copy.

This message was sent by Atlassian JIRA

View raw message