Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C8FE5200D29 for ; Thu, 26 Oct 2017 15:02:30 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C77E6160BF2; Thu, 26 Oct 2017 13:02:30 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E65451609E8 for ; Thu, 26 Oct 2017 15:02:29 +0200 (CEST) Received: (qmail 23911 invoked by uid 500); 26 Oct 2017 13:02:29 -0000 Mailing-List: contact user-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@impala.incubator.apache.org Delivered-To: mailing list user@impala.incubator.apache.org Received: (qmail 23901 invoked by uid 99); 26 Oct 2017 13:02:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Oct 2017 13:02:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 27BDE1807D0 for ; Thu, 26 Oct 2017 13:02:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.6 X-Spam-Level: X-Spam-Status: No, score=-1.6 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-2.8, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=icloud.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 4rVLRGdKLvjY for ; Thu, 26 Oct 2017 13:02:27 +0000 (UTC) Received: from pv35p12im-ztdg05041101.me.com (pv35p12im-ztdg05041101.me.com [17.133.186.214]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id C1CB95FD41 for ; Thu, 26 Oct 2017 13:02:26 +0000 (UTC) Received: from process-dkim-sign-daemon.pv35p12im-ztdg05041101.me.com by pv35p12im-ztdg05041101.me.com (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun 7 2017)) id <0OYF00J00LD6MM00@pv35p12im-ztdg05041101.me.com> for user@impala.apache.org; Thu, 26 Oct 2017 13:02:19 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=04042017; t=1509022939; bh=XbzKbniSUMdK2Wozvf5AA6FtUpxjKYYrReii2GeHFYY=; h=To:From:Subject:Date:Message-id:Content-type:MIME-version; b=HqWOgxQBqU526ftRB7c/Ocigh+fqGKOA+Msx1Xowd2vAGNKpwI4wbicpjAdLM0l5E Pqal94nJXM4mlpdoJIAv44gCptXChqhjKRDoTekFnZ3ZB3EvRTFqOlxdvVYRpdV2Mo eB7nycCg6+q/KmJdKchvjSM53fZiRhrpPhk8swvjMVVSN4LJN2FfCCKsSSPbbrcGDr pAGwdn32ydJG0xy+e1emgi6v/ih+lAnaI2kMzl6l/2yn/PFHmXEFXdGjhGk0rJ2qnG 6fpQ9rht1O6PtZMQYLW0kCsgRMKBUCunsOKjde8/VMz3fbwszxUwehnuJwojvbUNlJ ROlAKnbJgIwAQ== Received: from localhost (pv38p12im-ztdg02083501.me.com [17.133.131.38]) by pv35p12im-ztdg05041101.me.com (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun 7 2017)) with ESMTP id <0OYF00OGQLJT8G10@pv35p12im-ztdg05041101.me.com> for user@impala.apache.org; Thu, 26 Oct 2017 13:02:17 +0000 (GMT) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-10-26_03:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 clxscore=1011 suspectscore=1 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1710260180 To: user@impala.apache.org From: Bruno Quinart Subject: Parquet min/max statistics & null values Date: Thu, 26 Oct 2017 13:02:16 +0000 (GMT) X-Mailer: iCloud MailClient17GProject56 MailServer17G81.26636-16A-2490-810ed00eba53 Message-id: <206246c5-1338-458e-b28d-736fce6ab738@me.com> Content-type: multipart/alternative; boundary=Apple-Webmail-42--de6ac958-0a18-4bf5-9902-e6d280bd3050 MIME-version: 1.0 archived-at: Thu, 26 Oct 2017 13:02:31 -0000 --Apple-Webmail-42--de6ac958-0a18-4bf5-9902-e6d280bd3050 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8; format=flowed Hi all=0A=0AWith IMPALA-2328, Parquet row group statistics are now being u= sed to skip the row group completely if the min/max range is excluded from= the predicate.=0AWe have a use case in which we make sure the data is sor= ted on a 'key' and have then many selective queries on that 'key' field. W= e notice a significant performance increase.=0ASo thanks a lot for all the= work on that!=0A=0AOne thing we notice is an unexpected behavior for reco= rds where that 'key' has null values. It seems that as soon as null values= are present in a row group, the test on the min/max fails and the row gro= up is read.=0A=0AWe work with Impala 2.9. The data is put in parquet files= by Impala itself. We have noticed this effect for both bigint as decimal = fields.=C2=A0Note that it's difficult for me to extract the min/max statis= tics from the parquet files. The parquet-tools included in our distributio= n (5.12) is not the latest. And I was told PARQUET-327 would anyway not pr= int the those row group stats because of the way Impala stores them.=0AWe = do confirm the expected behavior (exactly one row group read for properly = sorted data) when we create a similar table but explicitly filter out all = null values for that 'key' field. We also notice that the the number of ro= w groups read (but zero records retained) is proportional to the number of= null values.=0A=0AIs this behavior expected?=0AIs there a fundamental rea= son those row groups can not be skipped?=0A=0AThanks!=0ABruno=0A=0A --Apple-Webmail-42--de6ac958-0a18-4bf5-9902-e6d280bd3050 Content-Type: multipart/related; type="text/html"; boundary=Apple-Webmail-86--de6ac958-0a18-4bf5-9902-e6d280bd3050 --Apple-Webmail-86--de6ac958-0a18-4bf5-9902-e6d280bd3050 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8;
Hi all

With IMPALA-2328= , Parquet row group statistics are now being used to skip the row group co= mpletely if the min/max range is excluded from the predicate.
We= have a use case in which we make sure the data is sorted on a 'key' and h= ave then many selective queries on that 'key' field. We notice a significa= nt performance increase.
So thanks a lot for all the work on tha= t!

One thing we notice is an= unexpected behavior for records where that 'key' has null values. It seem= s that as soon as null values are present in a row group, the test on the = min/max fails and the row group is read.

We work with Impala 2.9. The data is put in parquet files by = Impala itself. We have noticed this effect for both bigint as decimal fiel= ds. Note that it's difficult for me to extract t= he min/max statistics from the parquet files. The parquet-tools included i= n our distribution (5.12) is not the latest. And I was told PARQUET-327 wo= uld anyway not print the those row group stats because of the way Impala s= tores them.
We do confirm the expected behavior (exactly = one row group read for properly sorted data) when we create a similar tabl= e but explicitly filter out all null values for that 'key' field. We also = notice that the the number of row groups read (but zero records retained) = is proportional to the number of null values.

Is this behavior expected?
Is there a fund= amental reason those row groups can not be skipped?

Thanks!
Bruno

--Apple-Webmail-86--de6ac958-0a18-4bf5-9902-e6d280bd3050-- --Apple-Webmail-42--de6ac958-0a18-4bf5-9902-e6d280bd3050--