drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Altekruse (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4070) Metadata Caching : min/max values are null for varchar columns in auto partitioned data
Date Thu, 12 Nov 2015 16:56:11 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002403#comment-15002403
] 

Jason Altekruse commented on DRILL-4070:
----------------------------------------

yeah, I can take a look. I think this is caused by one of the few commits we picked up from
parquet-mr master. https://github.com/apache/parquet-mr/commit/e3b95020f777eb5e0651977f654c1662e3ea1f29

We had fixed this issue in our fork, but they ended up doing more to preserve existing behaviors
among the object models. As part of the change it looks like they are ignoring the statistics
on binary columns for older files by looking at the version number listed in the footer. I
think we may just be neglecting to write a proper version number and it assumes that it cannot
trust the statistics values.

I'll take a look to confirm and try to post a patch that will fix it for newly written files
at least. If you need this to work with old files we may need to write a small utility to
rewrite their footers and append new ones to the end of the files, that or require data to
be rewritten, but we obviously will try to avoid that.

> Metadata Caching : min/max values are null for varchar columns in auto partitioned data
> ---------------------------------------------------------------------------------------
>
>                 Key: DRILL-4070
>                 URL: https://issues.apache.org/jira/browse/DRILL-4070
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.3.0
>            Reporter: Rahul Challapalli
>            Priority: Critical
>         Attachments: cache.txt, fewtypes_varcharpartition.tar.tgz
>
>
> git.commit.id.abbrev=e78e286
> The metadata cache file created contains incorrect values for min/max fields for varchar
colums. The data is also partitioned on the varchar column
> {code}
> refresh table metadata fewtypes_varcharpartition;
> {code}
> As a result partition pruning is not happening. This was working after DRILL-3937 has
been fixed (d331330efd27dbb8922024c4a18c11e76a00016b)
> I attached the data set and the cache file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message