drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
Date Tue, 27 Sep 2016 00:49:20 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524649#comment-15524649
] 

ASF GitHub Bot commented on DRILL-4905:
---------------------------------------

GitHub user ppadma opened a pull request:

    https://github.com/apache/drill/pull/597

    DRILL-4905: Push down the LIMIT to the parquet reader scan.

    For limit N query, where N is less than current default record batchSize (256K for all
fixedlength, 32K otherwise), we still end up reading all 256K/32K rows from disk if rowGroup
has that many rows. This  causes performance degradation especially when there are large number
of columns. 
    This fix tries to address this problem by changing the record batchSize parquet record
reader uses so we don't read more than what is needed.
    Also, added a sys option (store.parquet.record_batch_size) to be able to set record batch
size.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ppadma/drill DRILL-4905

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #597
    
----
commit cd665ebdba11f8685ba446f5ec535c81ddd6edc7
Author: Padma Penumarthy <ppenumarthy@ppenumarthy-e653-mpr13.local>
Date:   2016-09-26T17:51:07Z

    DRILL-4905: Push down the LIMIT to the parquet reader scan to limit the numbers of records
read

----


> Push down the LIMIT to the parquet reader scan to limit the numbers of records read
> -----------------------------------------------------------------------------------
>
>                 Key: DRILL-4905
>                 URL: https://issues.apache.org/jira/browse/DRILL-4905
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.8.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>             Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to parquet reader.
> For queries like
> select * from <table> limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire row group.
This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message