drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-6147) Limit batch size for Flat Parquet Reader
Date Mon, 12 Feb 2018 00:42:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360192#comment-16360192

Paul Rogers commented on DRILL-6147:

Salim says:

Duplicate Implementation
- I am not contemplating two different implementations; one for Parquet and another for the
rest of the code
- Instead, I am reacting to the fact that we have two different processing patterns Row Oriented
and Columnar
- The goal is to offer both strategies depending on the operator

Paul's response:

Drill is columnar. But, batches must be collections of rows (all vectors must have the same
row count.) How we fill the batch may sometimes be row-wise (CSV), sometimes columnar (Parquet).
Even operators such as SVR could be columnar. That is, in the SVR, we could compress out unwanted
rows column-by-column which is likely much more CPU-cache friendly than what we are doing
now. The point is: Drill is like photons: it has a row/column duality and morphs between the
two depending on the context.

If we create a separate solution for the columnar read pattern, we must handle the entire
stack: writing to vectors, controlling vector sizes, handling overflow and the rest. Doing
so is, by definition, as separate implementation. It may seem like the new version is simple,
but that is only because you've not yet had the pleasure of working with the complicated use
cases such as deeply nested structures. Trust me: the simple flat case is simple. Beyond that,
things get very complex indeed.

> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>                 Key: DRILL-6147
>                 URL: https://issues.apache.org/jira/browse/DRILL-6147
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.13.0
> The Parquet reader currently uses a hard-coded batch size limit (32k rows) when creating
scan batches; there is no parameter nor any logic for controlling the amount of memory used.
This enhancement will allow Drill to take an extra input parameter to control direct memory

This message was sent by Atlassian JIRA

View raw message