impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Armstrong <tarmstr...@cloudera.com>
Subject Re: Parquet min/max statistics & null values
Date Thu, 26 Oct 2017 17:02:02 GMT
Hi Bruno,
 Could you provide an example of the specific predicates that aren't being
used to successfully skip the row group?

- Tim

On Thu, Oct 26, 2017 at 7:21 AM, Jeszy <jeszyb@gmail.com> wrote:

> Hello Bruno,
>
> Thanks for bringing this up. While not apparent from the commit
> comments, this limitation was mentioned during the code review:
> 'min/max are only set when there are non-null values, so we don't
> consider statistics for "is null".' (see
> https://gerrit.cloudera.org/#/c/6147/).
> It looks to me that this was intended, but I'll let others confirm.
> Definitely a point where we can improve.
>
> Thanks!
>
> On 26 October 2017 at 08:02, Bruno Quinart <bquinart@icloud.com> wrote:
> > Hi all
> >
> > With IMPALA-2328, Parquet row group statistics are now being used to skip
> > the row group completely if the min/max range is excluded from the
> > predicate.
> > We have a use case in which we make sure the data is sorted on a 'key'
> and
> > have then many selective queries on that 'key' field. We notice a
> > significant performance increase.
> > So thanks a lot for all the work on that!
> >
> > One thing we notice is an unexpected behavior for records where that
> 'key'
> > has null values. It seems that as soon as null values are present in a
> row
> > group, the test on the min/max fails and the row group is read.
> >
> > We work with Impala 2.9. The data is put in parquet files by Impala
> itself.
> > We have noticed this effect for both bigint as decimal fields. Note that
> > it's difficult for me to extract the min/max statistics from the parquet
> > files. The parquet-tools included in our distribution (5.12) is not the
> > latest. And I was told PARQUET-327 would anyway not print the those row
> > group stats because of the way Impala stores them.
> > We do confirm the expected behavior (exactly one row group read for
> properly
> > sorted data) when we create a similar table but explicitly filter out all
> > null values for that 'key' field. We also notice that the the number of
> row
> > groups read (but zero records retained) is proportional to the number of
> > null values.
> >
> > Is this behavior expected?
> > Is there a fundamental reason those row groups can not be skipped?
> >
> > Thanks!
> > Bruno
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message