drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5272) Text file reader is inefficient
Date Fri, 17 Feb 2017 04:53:41 GMT
Paul Rogers created DRILL-5272:
----------------------------------

             Summary: Text file reader is inefficient
                 Key: DRILL-5272
                 URL: https://issues.apache.org/jira/browse/DRILL-5272
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10
            Reporter: Paul Rogers
            Priority: Minor


>From inspection of the ScanBatch and CompliantTextReader.

Every batch holds about five implicit vectors. These are repeated for every row, which can
greatly increase incoming data size.

When populating the vectors, the allocation starts at 8 bytes and grows to 16 bytes, causing
a (slow) memory reallocation for every vector:

{code}
[org.apache.drill.exec.vector.UInt4Vector] - 
Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16]
{code}

Whether due to the above, or a different issues is causing memory growth in the scan batch:

{code}
Entry Memory: 6,456,448
Exit Memory: 7,636,312
Entry Memory: 7570560
Exit Memory: 8750424
...
{code}

Evidently the implicit vectors are added in response to a "SELECT *" query. Perhaps provide
them only if actually requested.

The vectors are populated for every row, making a copy of a potentially long file name and
path for every record. Since the values are common to every record, perhaps we can use the
same data copy for each, but have the offset vector for each record just point to the single
copy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message