drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khurram Faraaz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4387) Improve execution side when it handles skipAll query
Date Wed, 28 Sep 2016 13:16:20 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529561#comment-15529561

Khurram Faraaz commented on DRILL-4387:

I will track the wrong results issue separately.

As for this JIRA DRILL-4387, the fix is now verified.

with Fix on Drill 1.6.0 git commit ID: c67d070b, query takes 60.88 seconds.
0: jdbc:drill:schema=dfs.tmp> SELECT DISTINCT dir1 FROM `DRILL_4589`;
| dir1  |
| null  |
| Q2    |
| Q1    |
| Q3    |
| Q4    |
5 rows selected (60.883 seconds)

whereas without this fix, same query takes 106.069 seconds on Drill 1.6.0 git commit 1d890ff9,
which is one commit before the above commit.

> Improve execution side when it handles skipAll query
> ----------------------------------------------------
>                 Key: DRILL-4387
>                 URL: https://issues.apache.org/jira/browse/DRILL-4387
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>             Fix For: 1.6.0
> DRILL-4279 changes the planner side and the RecordReader in the execution side when they
handles skipAll query. However, it seems there are other places in the codebase that do not
handle skipAll query efficiently. In particular, in GroupScan or ScanBatchCreator, we will
replace a NULL or empty column list with star column. This essentially will force the execution
side (RecordReader) to fetch all the columns for data source. Such behavior will lead to big
performance overhead for the SCAN operator.
> To improve Drill's performance, we should change those places as well, as a follow-up
work after DRILL-4279.
> One simple example of this problem is:
> {code}
>    SELECT DISTINCT substring(dir1, 5) from  dfs.`/Path/To/ParquetTable`;  
> {code}
> The query does not require any regular column from the parquet file. However, ParquetRowGroupScan
and ParquetScanBatchCreator will put star column as the column list. In case table has dozens
or hundreds of columns, this will make SCAN operator much more expensive than necessary. 

This message was sent by Atlassian JIRA

View raw message