impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Volker ...@cloudera.com>
Subject Re: Parquet min/max statistics & null values
Date Thu, 26 Oct 2017 22:53:44 GMT
Hi Bruno,

To clarify on your original observation: Column chunks that only have null
values in them will not have their min_value and max_value fields populated
and thus won't be skipped based on stats. I filed IMPALA-6113
<https://issues.apache.org/jira/browse/IMPALA-6113> to track this.
IMPALA-5061 <https://issues.apache.org/jira/browse/IMPALA-5061> added
support to populate the null_count in statistics, allowing us to detect
column chunks that only contain NULLs. We should use that information to
skip row groups if the predicate allows us to.

Row groups with column chunks that have at least one non-null value should
get filtered correctly.

Cheers, Lars

On Thu, Oct 26, 2017 at 10:02 AM, Tim Armstrong <tarmstrong@cloudera.com>
wrote:

> Hi Bruno,
>  Could you provide an example of the specific predicates that aren't being
> used to successfully skip the row group?
>
> - Tim
>
> On Thu, Oct 26, 2017 at 7:21 AM, Jeszy <jeszyb@gmail.com> wrote:
>
>> Hello Bruno,
>>
>> Thanks for bringing this up. While not apparent from the commit
>> comments, this limitation was mentioned during the code review:
>> 'min/max are only set when there are non-null values, so we don't
>> consider statistics for "is null".' (see
>> https://gerrit.cloudera.org/#/c/6147/).
>> It looks to me that this was intended, but I'll let others confirm.
>> Definitely a point where we can improve.
>>
>> Thanks!
>>
>> On 26 October 2017 at 08:02, Bruno Quinart <bquinart@icloud.com> wrote:
>> > Hi all
>> >
>> > With IMPALA-2328, Parquet row group statistics are now being used to
>> skip
>> > the row group completely if the min/max range is excluded from the
>> > predicate.
>> > We have a use case in which we make sure the data is sorted on a 'key'
>> and
>> > have then many selective queries on that 'key' field. We notice a
>> > significant performance increase.
>> > So thanks a lot for all the work on that!
>> >
>> > One thing we notice is an unexpected behavior for records where that
>> 'key'
>> > has null values. It seems that as soon as null values are present in a
>> row
>> > group, the test on the min/max fails and the row group is read.
>> >
>> > We work with Impala 2.9. The data is put in parquet files by Impala
>> itself.
>> > We have noticed this effect for both bigint as decimal fields. Note that
>> > it's difficult for me to extract the min/max statistics from the parquet
>> > files. The parquet-tools included in our distribution (5.12) is not the
>> > latest. And I was told PARQUET-327 would anyway not print the those row
>> > group stats because of the way Impala stores them.
>> > We do confirm the expected behavior (exactly one row group read for
>> properly
>> > sorted data) when we create a similar table but explicitly filter out
>> all
>> > null values for that 'key' field. We also notice that the the number of
>> row
>> > groups read (but zero records retained) is proportional to the number of
>> > null values.
>> >
>> > Is this behavior expected?
>> > Is there a fundamental reason those row groups can not be skipped?
>> >
>> > Thanks!
>> > Bruno
>> >
>>
>
>

Mime
View raw message