drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-2743) Parquet file metadata caching
Date Fri, 21 Aug 2015 16:58:45 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707029#comment-14707029
] 

Aman Sinha edited comment on DRILL-2743 at 8/21/15 4:58 PM:
------------------------------------------------------------

I added specific comments in the pull request.  A few broader review comments below: 
 1. The first query that does not see the metadata file will gather the metadata, 
    so the elapsed time of the first query will be very different from a subsequent 
    query. We should probably document that. 
 2. How do we prevent concurrent REFRESH METADATA operations on the 
   same table or subdirectory since they will be updating the same file ? 
 3. Agree with Rahul's comment for log message and Jacques's comment on
   versioning..these will be very useful.  I also foresee that people might want to 
  point to a specific metadata file and not be restricted to the hardcoded name. 
  This could be an enhancement. 
  


was (Author: amansinha100):
I added specific comments in the pull request.  A few broader review comments below: 
 1. The first query that does not see the metadata file will gather the metadata, so
    the elapsed time of the first query will be very different from a subsequent query 
    . We should probably document that. 
 2. How do we prevent concurrent REFRESH METADATA operations on the 
   same table or subdirectory since they will be updating the same file ? 
 3. Agree with Rahul's comment for log message and Jacques's comment on
   versioning..these will be very useful.  I also foresee that people might want to 
  point to a specific metadata file and not be restricted to the hardcoded name. 
  This could be an enhancement. 
  

> Parquet file metadata caching
> -----------------------------
>
>                 Key: DRILL-2743
>                 URL: https://issues.apache.org/jira/browse/DRILL-2743
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Storage - Parquet
>            Reporter: Steven Phillips
>            Assignee: Aman Sinha
>             Fix For: 1.2.0
>
>         Attachments: DRILL-2743.patch, drill.parquet_metadata
>
>
> To run a query against parquet files, we have to first recursively search the directory
tree for all of the files, get the block locations for each file, and read the footer from
each file, and this is done during the planning phase. When there are many files, this can
result in a very large delay in running the query, and it does not scale.
> However, there isn't really any need to read the footers during planning, if we instead
treat each parquet file as a single work unit, all we need to know are the block locations
for the file, the number of rows, and the columns. We should store only the information which
we need for planning in a file located in the top directory for a given parquet table, and
then we can delay reading of the footers until execution time, which can be done in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message