impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruno Quinart <bquin...@icloud.com>
Subject Parquet min/max statistics & null values
Date Thu, 26 Oct 2017 13:02:16 GMT
Hi all

With IMPALA-2328, Parquet row group statistics are now being used to skip the row group completely
if the min/max range is excluded from the predicate.
We have a use case in which we make sure the data is sorted on a 'key' and have then many
selective queries on that 'key' field. We notice a significant performance increase.
So thanks a lot for all the work on that!

One thing we notice is an unexpected behavior for records where that 'key' has null values.
It seems that as soon as null values are present in a row group, the test on the min/max fails
and the row group is read.

We work with Impala 2.9. The data is put in parquet files by Impala itself. We have noticed
this effect for both bigint as decimal fields. Note that it's difficult for me to extract
the min/max statistics from the parquet files. The parquet-tools included in our distribution
(5.12) is not the latest. And I was told PARQUET-327 would anyway not print the those row
group stats because of the way Impala stores them.
We do confirm the expected behavior (exactly one row group read for properly sorted data)
when we create a similar table but explicitly filter out all null values for that 'key' field.
We also notice that the the number of row groups read (but zero records retained) is proportional
to the number of null values.

Is this behavior expected?
Is there a fundamental reason those row groups can not be skipped?

Thanks!
Bruno


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message