drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3918) Avoid extra loading of the metadata cache file
Date Sun, 11 Oct 2015 18:18:05 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952372#comment-14952372

ASF GitHub Bot commented on DRILL-3918:

Github user jacques-n commented on a diff in the pull request:

    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java
    @@ -44,6 +45,10 @@
       public List<String> files;
       public String selectionRoot;
    +  // this is a temporary location for the reference to Parquet metadata
    +  // TODO: ideally this should be in a Parquet specific derived class.
    +  public ParquetTableMetadata_v1 parquetMeta = null;
    --- End diff --
    Can we please make this a private field and add the appropriate getter before merging?

> Avoid extra loading of the metadata cache file
> ----------------------------------------------
>                 Key: DRILL-3918
>                 URL: https://issues.apache.org/jira/browse/DRILL-3918
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
> The metadata cache file is currently being deserialized and read twice: once during {{ParquetFormatPlugin.expandSelection()}}
that happens as part of the creation of DynamicDrillTable and once during ParquetGroupScan.
 This was also pointed out by [~sphillips] in DRILL-3901.   We should avoid doing the read
> The performance issue is getting exposed more now because of the fix for DRILL-3917 which
fixed the behavior of expandSelection() by reading the metadata cache file through the correct
interface (it was previously erroring out and not spending any time in the expansion). This
fix is needed for correct functionality.   However, performance numbers show a slowdown of
about 2.7x for the 400K files test using caching.  In my view, this performance comparison
is not very meaningful because of the prior bug.  
> This JIRA is to specifically targeting the extra load of the metadata cache file.  There
are other opportunities for improvement (for instance reading from the metadata cache is single
threaded whereas reading from parquet files gets parallelized.  That should be a separate

This message was sent by Atlassian JIRA

View raw message