drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Phillips (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2743) Parquet file metadata caching
Date Thu, 20 Aug 2015 00:11:46 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704027#comment-14704027

Steven Phillips commented on DRILL-2743:

1. Currently, there is no log message, but I could add one.
2. I am not sure what you mean by "change anything", but the case of both files and directories
is handled.
3. I don't think there will be changes to the format, but I can't guarantee that. I also expect
there to be changes to the format in future releases.
4. Those permissions will allow anyone to read the file. I do see a potential problem, though.
Currently, if a change is detected to the underlying files, the metadata is updated automatically
when a query is run. If the user doesn't have write permission, this will cause a failure.

> Parquet file metadata caching
> -----------------------------
>                 Key: DRILL-2743
>                 URL: https://issues.apache.org/jira/browse/DRILL-2743
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Storage - Parquet
>            Reporter: Steven Phillips
>            Assignee: Aman Sinha
>             Fix For: 1.2.0
>         Attachments: DRILL-2743.patch, drill.parquet_metadata
> To run a query against parquet files, we have to first recursively search the directory
tree for all of the files, get the block locations for each file, and read the footer from
each file, and this is done during the planning phase. When there are many files, this can
result in a very large delay in running the query, and it does not scale.
> However, there isn't really any need to read the footers during planning, if we instead
treat each parquet file as a single work unit, all we need to know are the block locations
for the file, the number of rows, and the columns. We should store only the information which
we need for planning in a file located in the top directory for a given parquet table, and
then we can delay reading of the footers until execution time, which can be done in parallel.

This message was sent by Atlassian JIRA

View raw message