From issues-return-138549-archive-asf-public=cust-asf.ponee.io@hive.apache.org Wed Oct 3 16:10:06 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 11AD818065B for ; Wed, 3 Oct 2018 16:10:05 +0200 (CEST) Received: (qmail 94694 invoked by uid 500); 3 Oct 2018 14:10:05 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 94679 invoked by uid 99); 3 Oct 2018 14:10:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Oct 2018 14:10:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id CA9841A0420 for ; Wed, 3 Oct 2018 14:10:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -107.5 X-Spam-Level: X-Spam-Status: No, score=-107.5 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, KAM_BADIPHTTP=2, NORMAL_HTTP_TO_IP=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 710-WwK1SlzV for ; Wed, 3 Oct 2018 14:10:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id F2EDE5F6BB for ; Wed, 3 Oct 2018 14:10:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 011ACE262A for ; Wed, 3 Oct 2018 14:10:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id B236F2552A for ; Wed, 3 Oct 2018 14:10:00 +0000 (UTC) Date: Wed, 3 Oct 2018 14:10:00 +0000 (UTC) From: "Hive QA (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637008#comment-16637008 ] Hive QA commented on HIVE-20660: -------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 12s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 59s{color} | {color:blue} ql in master has 2321 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 36s{color} | {color:red} ql: The patch generated 6 new + 38 unchanged - 0 fixed = 44 total (was 38) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch 1 line(s) with tabs. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 3s{color} | {color:red} ql generated 1 new + 2321 unchanged - 0 fixed = 2322 total (was 2321) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 13s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 23m 20s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:ql | | | Comparison of String objects using == or != in org.apache.hadoop.hive.ql.exec.OperatorUtils.areBacktrackedExprsCompatible(List, List) At OperatorUtils.java:== or != in org.apache.hadoop.hive.ql.exec.OperatorUtils.areBacktrackedExprsCompatible(List, List) At OperatorUtils.java:[line 536] | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-14202/dev-support/hive-personality.sh | | git revision | master / a06a370 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-14202/yetus/diff-checkstyle-ql.txt | | whitespace | http://104.198.109.242/logs//PreCommit-HIVE-Build-14202/yetus/whitespace-tabs.txt | | findbugs | http://104.198.109.242/logs//PreCommit-HIVE-Build-14202/yetus/new-findbugs-ql.html | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-14202/yetus.txt | | Powered by | Apache Yetus http://yetus.apache.org | This message was automatically generated. > Group by statistics estimation could be improved by bounding the total number of rows to source table > ----------------------------------------------------------------------------------------------------- > > Key: HIVE-20660 > URL: https://issues.apache.org/jira/browse/HIVE-20660 > Project: Hive > Issue Type: Improvement > Components: Statistics > Affects Versions: 4.0.0 > Reporter: Vineet Garg > Assignee: Vineet Garg > Priority: Major > Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch > > > Currently the stats for group by is estimated by taking product of NDVs of all the keys and bounding it by the number of rows of its input. This bound could be improved by using the source table instead of immediate input, the insight in this case is that cardinality/ndvs of a table can not go beyond the original (outer joins will only add NULLs thereby increasing the cardinality by 1). > Note that the assumption here is that group by keys all belong to the same source table/input. > This will improve the estimation in situations where group by is executed after joins wherein Hive could end up estimating the number of rows. > *Reproducer* > {code:sql} > set hive.stats.fetch.column.stats=true; > create table t1(i int, j int); > alter table t1 update statistics set('numRows'='10000', 'rawDataSize'='18000'); > alter table t1 update statistics for column i set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0'); > alter table t1 update statistics for column j set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50'); > create table t2(i2 int, j2 int); > alter table t2 update statistics set('numRows'='100000000', 'rawDataSize'='10000'); > alter table t2 update statistics for column i2 set('numDVs'='10000000','numNulls'='0','highValue'='8000','lowValue'='0'); > alter table t2 update statistics for column j2 set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1'); > explain select count (1) from t1,t2 > where t1.j=t2.i2 > group by t1.i, t1.j; > {code} > {code:sql} > Reducer 2 > Reduce Operator Tree: > Merge Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col1 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 99700 Data size: 797288 Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > aggregations: count() > keys: _col0 (type: int), _col1 (type: int) > mode: hash > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 49850 Data size: 797448 Basic stats: COMPLETE Column stats: COMPLETE <========== > Reduce Output Operator > key expressions: _col0 (type: int), _col1 (type: int) > sort order: ++ > Map-reduce partition columns: _col0 (type: int), _col1 (type: int) > Statistics: Num rows: 49850 Data size: 797448 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col2 (type: bigint) > ..................... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)