drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2287) Filesystem partitioning is slow
Date Wed, 25 Feb 2015 00:29:05 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335723#comment-14335723
] 

Aman Sinha commented on DRILL-2287:
-----------------------------------

For a simple count(*) query against Parquet files, Drill uses the row count from the metadata,
hence it is faster.  This is expected behavior.  I don't think this has anything to do with
'filesystem partitioning' as said in the summary.  

> Filesystem partitioning is slow
> -------------------------------
>
>                 Key: DRILL-2287
>                 URL: https://issues.apache.org/jira/browse/DRILL-2287
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Jinfeng Ni
>            Priority: Minor
>
> We have created a number of Parquet files in different directories (e.g. 1, 2, 3, 4)
to partition our data on the filesystem.
> Assuming we only have 4 directories (1, 2, 3 and 4), when executing a query like:
> {code:sql}
> select count(*) from dfs.tmp.mydata where dir0 in (1, 2, 3, 4)
> {code}
> The query is significantly slower than:
> {code:sql}
> select count(*) from dfs.tmp.mydata
> {code}
> Looking at the physical plans, it looks like even if dir0 is only in the WHERE clause,
it'll emit that from the scan, which then needs an extra step (a projection) to only project
through the count (removing dir0).  This appears to be the cause of the slowdown.
> To make it even more confusing, if you only select the LAST directory (i.e. in the case,
4), then it has a different physical plan again and seems to use a union-exchange.
> Ultimately, the query planner should realise that dir0 is not projected and then once
the pushdown filesystem filtering is done, remove dir0 from being emitted from the scan and
not require a project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message