drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-6147) Limit batch size for Flat Parquet Reader
Date Tue, 13 Feb 2018 18:26:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16362833#comment-16362833
] 

Aman Sinha commented on DRILL-6147:
-----------------------------------

 

Regarding Paul's comment "Said another way, predicate push-down forces row-by-row processing,
even though the underlying storage format is columnar. (This is why the Filter operator works
row-by-row.)"

This is not quite true, even though currently the Filter operator works this way.  As described
in Daniel Abadi's blog on the columnar storage formats [1],  the vectorized processing
of filter conditions yielded a 4x improvement for his (admittedly simple) experiment.  We
do want to keep the option open for such enhancements.  

I do agree with Paul on the more general point that we have to be able to handle efficient
access to complex data (arrays, maps, repeated maps) without running into memory situations.
 It sounds to me that an adaptive algorithm is needed where the scanner determines whether
to use the 'bulk loading columnar' read where appropriate while still allowing the 'result
set loader row-by-row' read for data that is complex types. 

[1] [http://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html,] 

> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
>                 Key: DRILL-6147
>                 URL: https://issues.apache.org/jira/browse/DRILL-6147
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.13.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows) when creating
scan batches; there is no parameter nor any logic for controlling the amount of memory used.
This enhancement will allow Drill to take an extra input parameter to control direct memory
usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message