impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zoltan Ivanfi (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Fri, 20 Jan 2017 15:30:12 GMT
Zoltan Ivanfi has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer

Patch Set 2:

> > (1 comment)
 > Apologies for the delayed reply. Hive writes timestamps using 12
 > bytes using little endian. Then it passes them to parquet-mr as a
 > BINARY string, which means it is hitting PARQUET-251. This explains
 > why I saw the odd values for min/max in my tests.
 > Internally parquet-mr orders BINARY values using byte comparison,
 > potentially leading to a min/max value not being the semantically
 > smallest/largest value of a set of values. I am inclined to call
 > this a bug in hive, but I'm curious to hear what you think about
 > this.

I don't think it's a bug that the min/max corresponds to the binary ordering, since at Parquet's
level timestamps are just meaningless bytes. If we were using a proper Parquet logical type
then it would be different, but when saving 12 bytes, I think the proper order is the binary
ordering. In any case, I think we should aim for Hive-compatibility in this.

The bug that causes the last row to be both the min and max values is a major pain though
that will make column statistics for byte arrays totally useless. I don't see how we could
handle that other than ignoring any such min/max values written by affected Hive versions.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-Reviewer: Zoltan Ivanfi <>
Gerrit-HasComments: No

View raw message