drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4530) Improve metadata cache performance for queries with single partition
Date Thu, 31 Mar 2016 18:49:25 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220436#comment-15220436

Aman Sinha commented on DRILL-4530:

In order to address this in a reasonably efficient way,  one option I am exploring is to keep
a separate  ".parquet_directories"  file similar to the parquet metadata cache file.  The
directories file is just a copy of only the "directories" field of the metadata cache (original
metadata cache file remains intact).  The motivation for doing this is the list of directories
is much smaller and I can apply the Partition Pruning only on directories first;  this allows
optimizations such as detecting a single partition and reading a smaller metadata cache file
from that partition.   [~jnadeau] have you or others explored keeping the directories list
separately ?  I am not proposing to break up the existing cache file (at least not for now
... I am aware based on discussion with [~parthc] that it could break backward compatibility).

> Improve metadata cache performance for queries with single partition 
> ---------------------------------------------------------------------
>                 Key: DRILL-4530
>                 URL: https://issues.apache.org/jira/browse/DRILL-4530
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 1.6.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.7.0
> Consider two types of queries which are run with Parquet metadata caching: 
> {noformat}
> query 1:
> SELECT col FROM  `A/B/C`;
> query 2:
> SELECT col FROM `A` WHERE dir0 = 'B' AND dir1 = 'C';
> {noformat}
> For a certain dataset, the query1 elapsed time is 1 sec whereas query2 elapsed time is
9 sec even though both are accessing the same amount of data.  The user expectation is that
they should perform roughly the same.  The main difference comes from reading the bigger metadata
cache file at the root level 'A' for query2 and then applying the partitioning filter.  query1
reads a much smaller metadata cache file at the subdirectory level. 

This message was sent by Atlassian JIRA

View raw message