drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2517) Apply Partition pruning before reading files during planning
Date Mon, 18 Jan 2016 23:56:40 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105993#comment-15105993

Jinfeng Ni commented on DRILL-2517:

Pull request: https://github.com/apache/drill/pull/328/files 

The PR contains both the change from Adam and Mehant. I added some code change on top of their

I did some preliminary performance comparison on my Mac laptop.  With 115k parquet files in
total, it's organized in 25 directories (1990, 1991, ... ), and each directory has four subdirectories
(Q1, Q2, Q3, Q4). 

For the following query : 
explain plan for select * from t1 where dir0= 1990 and dir1 = 'Q1';

Master branch shows 19.4 seconds,  DRLL-2517 patch shows 8.8 seconds. Both cases are measured
for the second run with warm cache. 
1 row selected (19.434 seconds)

1 row selected (8.845 seconds)

The log shows that the time for reading parquet meta data from footer files is significantly
reduced (from 7388ms to 102ms) , due the the pruning effect. 

On master branch: 
Fetch parquet metadata: Executed 115544 out of 115544 using 16 threads. Time: 7388ms total,
1.019393ms avg, 745ms max.

With patch:
Fetch parquet metadata: Executed 1111 out of 1111 using 16 threads. Time: 102ms total, 1.053320ms
avg, 8ms max.

> Apply Partition pruning before reading files during planning
> ------------------------------------------------------------
>                 Key: DRILL-2517
>                 URL: https://issues.apache.org/jira/browse/DRILL-2517
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Jinfeng Ni
>             Fix For: Future
> Partition pruning still tries to read Parquet files during the planning stage even though
they don't match the partition filter.
> For example, if there were an invalid Parquet file in a directory that should not be
> {code}
> 0: jdbc:drill:zk=local> select sum(price) from dfs.tmp.purchases where dir0 = 1;
> Query failed: IllegalArgumentException: file:/tmp/purchases/4/0_0_0.parquet is not a
Parquet file (too small)
> {code}
> The reason is that the partition pruning happens after the Parquet plugin tries to read
the footer of each file.
> Ideally, partition pruning would happen first before the format plugin gets involved.

This message was sent by Atlassian JIRA

View raw message