drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Phillips (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-2743) Parquet file metadata caching
Date Fri, 10 Apr 2015 00:22:14 GMT
Steven Phillips created DRILL-2743:

             Summary: Parquet file metadata caching
                 Key: DRILL-2743
                 URL: https://issues.apache.org/jira/browse/DRILL-2743
             Project: Apache Drill
          Issue Type: New Feature
          Components: Storage - Parquet
            Reporter: Steven Phillips
            Assignee: Steven Phillips

To run a query against parquet files, we have to first recursively search the directory tree
for all of the files, get the block locations for each file, and read the footer from each
file, and this is done during the planning phase. When there are many files, this can result
in a very large delay in running the query, and it does not scale.

However, there isn't really any need to read the footers during planning, if we instead treat
each parquet file as a single work unit, all we need to know are the block locations for the
file, the number of rows, and the columns. We should store only the information which we need
for planning in a file located in the top directory for a given parquet table, and then we
can delay reading of the footers until execution time, which can be done in parallel.

This message was sent by Atlassian JIRA

View raw message