impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruno Quinart <bquin...@icloud.com>
Subject Re: Parquet min/max statistics & null values
Date Sat, 28 Oct 2017 20:19:19 GMT
Hi

Thanks for your replies.

An example query:
SELECT * FROM reasonably_sized_partitioned_table
WHERE partition_key_dtnr = 20171028
AND sorted_column = 56789123456
(same effect with an IN instead of the last equality)

I checked the behavior in our cluster again and can now confirm Lars explanation:
- Filtering on stats works well as from 2.9, except when there is only null values for the
column in a parquet file
- And that would be addressed by IMPALA-6113
I was wrongly assuming that in some of my tables there would not have been enough null values
to fill a parquet file for each partition.

I wonder how you verify those stats in Parquet metadata. Parquet-tools currently does not
print them.

Thanks for the help
Bruno

On 27 Oct, 2017, at 12:54 AM, Lars Volker wrote:

Hi Bruno,

To clarify on your original observation: Column chunks that only have null
values in them will not have their min_value and max_value fields populated
and thus won't be skipped based on stats. I filed IMPALA-6113
<https://issues.apache.org/jira/browse/IMPALA-6113> to track this.
IMPALA-5061 <https://issues.apache.org/jira/browse/IMPALA-5061> added
support to populate the null_count in statistics, allowing us to detect
column chunks that only contain NULLs. We should use that information to
skip row groups if the predicate allows us to.

Row groups with column chunks that have at least one non-null value should
get filtered correctly.

Cheers, Lars

On Thu, Oct 26, 2017 at 10:02 AM, Tim Armstrong wrote:

Hi Bruno,
Could you provide an example of the specific predicates that aren't being
used to successfully skip the row group?

- Tim

On Thu, Oct 26, 2017 at 7:21 AM, Jeszy wrote:

Hello Bruno,

Thanks for bringing this up. While not apparent from the commit
comments, this limitation was mentioned during the code review:
'min/max are only set when there are non-null values, so we don't
consider statistics for "is null".' (see
https://gerrit.cloudera.org/#/c/6147/).
It looks to me that this was intended, but I'll let others confirm.
Definitely a point where we can improve.

Thanks!

On 26 October 2017 at 08:02, Bruno Quinart wrote:
> Hi all
>
> With IMPALA-2328, Parquet row group statistics are now being used to
skip
> the row group completely if the min/max range is excluded from the
> predicate.
> We have a use case in which we make sure the data is sorted on a 'key'
and
> have then many selective queries on that 'key' field. We notice a
> significant performance increase.
> So thanks a lot for all the work on that!
>
> One thing we notice is an unexpected behavior for records where that
'key'
> has null values. It seems that as soon as null values are present in a
row
> group, the test on the min/max fails and the row group is read.
>
> We work with Impala 2.9. The data is put in parquet files by Impala
itself.
> We have noticed this effect for both bigint as decimal fields. Note that
> it's difficult for me to extract the min/max statistics from the parquet
> files. The parquet-tools included in our distribution (5.12) is not the
> latest. And I was told PARQUET-327 would anyway not print the those row
> group stats because of the way Impala stores them.
> We do confirm the expected behavior (exactly one row group read for
properly
> sorted data) when we create a similar table but explicitly filter out
all
> null values for that 'key' field. We also notice that the the number of
row
> groups read (but zero records retained) is proportional to the number of
> null values.
>
> Is this behavior expected?
> Is there a fundamental reason those row groups can not be skipped?
>
> Thanks!
> Bruno
>




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message