drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Altekruse (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4070) Metadata Caching : min/max values are null for varchar columns in auto partitioned data
Date Fri, 13 Nov 2015 18:21:11 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004436#comment-15004436

Jason Altekruse commented on DRILL-4070:

I clarified later that the fixes I was considering could risk wrong results on foreign parquet
files if we had this behavior by default, or if we put in a switch, users could risk losing
correctness by not realizing the switch is on. No one weighed in on the proposed solutions
here or on the vote thread.

I would advocate for the migration tool by itself, because the issue with non-migrated files
is performance, not correctness, and migration should be simple. It does not require rewriting
data, only metadata, and would work with HDFS as it would only involve appending a new footer
to the file, no need to change the content in the currently written parts of the file.

If someone wants to work on the switch they can. Unfortunately the code that is reading the
meta-data and deciding to ignore the statistics is in the parquet library itself. So we need
to make a case that this behavior should be shared across all tools, or to add a public API
to allow consumers of the library to provide their own version filter to override the default

I think overall this is just an unfortunate cost of us maintaining a fork for so long. I don't
want users to have to use a migration tool, but I think it is the safest solution.

> Metadata Caching : min/max values are null for varchar columns in auto partitioned data
> ---------------------------------------------------------------------------------------
>                 Key: DRILL-4070
>                 URL: https://issues.apache.org/jira/browse/DRILL-4070
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.3.0
>            Reporter: Rahul Challapalli
>            Priority: Blocker
>             Fix For: 1.3.0
>         Attachments: cache.txt, fewtypes_varcharpartition.tar.tgz
> git.commit.id.abbrev=e78e286
> The metadata cache file created contains incorrect values for min/max fields for varchar
colums. The data is also partitioned on the varchar column
> {code}
> refresh table metadata fewtypes_varcharpartition;
> {code}
> As a result partition pruning is not happening. This was working after DRILL-3937 has
been fixed (d331330efd27dbb8922024c4a18c11e76a00016b)
> I attached the data set and the cache file

This message was sent by Atlassian JIRA

View raw message