drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4053) Reduce metadata cache file size
Date Fri, 13 Nov 2015 23:31:11 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004909#comment-15004909

ASF GitHub Bot commented on DRILL-4053:

Github user jacques-n commented on the pull request:

    You're currently using an alternative file name for this. I think it would be better if
we use the version field and continue to use the same file name. I assume we'll have many
versions of this file. Also, what is the expected user result if they query a directory with
an old file? Can we maintain multiple classes and dispatch on version?

> Reduce metadata cache file size
> -------------------------------
>                 Key: DRILL-4053
>                 URL: https://issues.apache.org/jira/browse/DRILL-4053
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Metadata
>    Affects Versions: 1.3.0
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 1.4.0
> The parquet metadata cache file has fair amount of redundant metadata that causes the
size of the cache file to bloat. Two things that we can reduce are :
> 1) Schema is repeated for every row group. We can keep a merged schema (similar to what
was discussed for insert into functionality) 2) The max and min value in the stats are used
for partition pruning when the values are the same. We can keep the maxValue only and that
too only if it is the same as the minValue.

This message was sent by Atlassian JIRA

View raw message