impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Fri, 20 Jan 2017 17:15:08 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 1:

> > That is really unfortunate that our timestamps are treated as
 > byte
 > > arrays by parquet-mr - it makes the min/max stats mostly useless
 > > for pruning files. I feel like this is a bug in parquet-mr, since
 > > INT96 is in the spec (https://github.com/apache/parquet-format/blob/98c5e2b8575a809b09d996080428be730614d374/Encodings.md)
 > > and it's being treated inconsistently with int32/int64. Common
 > > sense would dictate that min/max of int96 should be treated the
 > > same as int32/int64. Seems like something we should open an issue
 > > against Parquet for? And Hive? Otherwise our timestamp stats will
 > > be pretty useless. In any case we should clarify this before
 > > writing out our own incompatible stats.
 > 
 > I agree, in fact this may actually be two separate bugs.
 > 
 > 1) parquet-mr uses Binary internally to store INT96, and will use
 > BinaryStatistics for those values (https://github.com/Parquet/parquet-mr/blob/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-column/src/main/java/parquet/column/statistics/Statistics.java#L61).
 > 2) Hive hands Timestamps over to parquet-mr as BINARY, too, instead
 > of using INT96. Currently these won't make a difference, but once
 > statistics support for INT96 will be fixed in parquet-mr, Hive
 > would need to catch up.
 > 
 > @Zoltan, should I go ahead and open one issue with each of them to
 > sort this out?

Our read path will have to contain some logic to deal with corrupt statistics written by parquet-mr
1.5, so we can filter those out. In the same code path we could filter all timestamp statistics
written by Hive until the ordering get's fixed.

However, the statistics we write would be incompatible with Hive. Older versions of Hive will
be unable to detect that the semantics have changed from little endian binary ordering to
numeric ordering, so I currently don't see an alternative to encoding them in 12 byte little
endian binaries, and then ordering them bytewise.

-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi+gerrit@cloudera.com>
Gerrit-HasComments: No

Mime
View raw message