drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (DRILL-3918) Avoid extra loading of the metadata cache file
Date Sun, 11 Oct 2015 19:48:05 GMT

     [ https://issues.apache.org/jira/browse/DRILL-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Aman Sinha resolved DRILL-3918.
       Resolution: Fixed
    Fix Version/s: 1.2.0

Fixed in b4d47c56b.  
Performance numbers indicate that this fix reduces the elapsed time of an 'Explain select
count(*) from table' query where table is 400K files by about 100 seconds.   There is more
work needed to improve the cache performance that can be targeted for a later release.  

> Avoid extra loading of the metadata cache file
> ----------------------------------------------
>                 Key: DRILL-3918
>                 URL: https://issues.apache.org/jira/browse/DRILL-3918
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.2.0
> The metadata cache file is currently being deserialized and read twice: once during {{ParquetFormatPlugin.expandSelection()}}
that happens as part of the creation of DynamicDrillTable and once during ParquetGroupScan.
 This was also pointed out by [~sphillips] in DRILL-3901.   We should avoid doing the read
> The performance issue is getting exposed more now because of the fix for DRILL-3917 which
fixed the behavior of expandSelection() by reading the metadata cache file through the correct
interface (it was previously erroring out and not spending any time in the expansion). This
fix is needed for correct functionality.   However, performance numbers show a slowdown of
about 2.7x for the 400K files test using caching.  In my view, this performance comparison
is not very meaningful because of the prior bug.  
> This JIRA is to specifically targeting the extra load of the metadata cache file.  There
are other opportunities for improvement (for instance reading from the metadata cache is single
threaded whereas reading from parquet files gets parallelized.  That should be a separate

This message was sent by Atlassian JIRA

View raw message