drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khurram Faraaz (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-5661) CSV reader created, holds onto two buffers per file with headers
Date Thu, 06 Jul 2017 05:49:00 GMT

     [ https://issues.apache.org/jira/browse/DRILL-5661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Khurram Faraaz updated DRILL-5661:
    Component/s: Storage - Text & CSV

> CSV reader created, holds onto two buffers per file with headers
> ----------------------------------------------------------------
>                 Key: DRILL-5661
>                 URL: https://issues.apache.org/jira/browse/DRILL-5661
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>             Fix For: Future
> DRILL-5273 fixed a problem in the "compliant" (CSV) record reader that would cause Drill
to exhaust memory. Each reader would allocate two direct memory blocks, but not free them
until the end of the fragment. Scan 1000 files and we would get 1000 allocations, with only
a single pair being active at a time.
> As it turns out, DRILL-5273 missed a second pair created when reading column headers:
> {code}
>  private String [] extractHeader() throws SchemaChangeException, IOException, ExecutionSetupException{
> ...
>     TextInput hInput = new TextInput(settings,  hStream, oContext.getManagedBuffer(READ_BUFFER),
0, split.getLength());
>     this.reader = new TextReader(settings, hInput, hOutput, oContext.getManagedBuffer(WHITE_SPACE_BUFFER));
> {code}
> If a query uses CSV column headings, the query is subject to the same memory exhaustion
seen earlier for `columns` style queries. (And, before DRILL-5273, queries with column headers
were twice as subject to memory exhaustion.)
> The solution is to simply reuse the existing buffers: the buffers are then first used
for the header line, then reused for data lines. No need at all for two sets of buffers.

This message was sent by Atlassian JIRA

View raw message