Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8DBDC11785 for ; Fri, 29 Aug 2014 18:05:54 +0000 (UTC) Received: (qmail 98022 invoked by uid 500); 29 Aug 2014 18:05:53 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 97957 invoked by uid 500); 29 Aug 2014 18:05:53 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 97941 invoked by uid 500); 29 Aug 2014 18:05:53 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 97938 invoked by uid 99); 29 Aug 2014 18:05:53 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Aug 2014 18:05:53 +0000 Date: Fri, 29 Aug 2014 18:05:53 +0000 (UTC) From: "Mostafa Mokhtar (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-7913) Simplify predicates for CBO MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Mostafa Mokhtar created HIVE-7913: ------------------------------------- Summary: Simplify predicates for CBO Key: HIVE-7913 URL: https://issues.apache.org/jira/browse/HIVE-7913 Project: Hive Issue Type: Bug Components: CBO Affects Versions: 0.13.1 Reporter: Mostafa Mokhtar Fix For: 0.14.0 I noticed that the estimate number of rows in Map joins is higher after the join than before the join that is with column stats fetch ON or OFF. TPC-DS Q55 was a good example for that, the issue is that the current statistics provide us enough information that we can estimate with strong confidence that the joins are one to many and not many to many. Joining store_sales x item on ss_item_sk = i_item_sk, we know that the NDV, min and max values for both join columns match while the row counts are different this pattern indicates a PK/FK relationship between store_sales and item. Yet when a filter is applied on item and reduces the number of rows from 462K to 7K we estimate a many to many join between the filtered item and store_sales and as a result the estimate number of rows coming out of the join is off by several orders of magnitude. Available information from the stats {code} Table Join column NDV from describe NDV actual min max item i_item_sk 439,501 462,000 1 462,000 date_dim d_date_sk 65,332 73,049 2,415,022 2,488,070 store_sales ss_item_sk 439,501 462,000 1 462,000 store_sales ss_sold_date_sk 2,226 1,823 2,450,816 2,452,642 {code} Same thing applies to store_sales and date_dim but with a caveat that the NDV , min and max values don't match where date_dim has a bigger domain and accordingly a higher NDV count. For joining store_sales and item on on ss_item_sk = i_item_sk since both columns have the same NDV, min and max values we can safely conclude that selectivity on item will translate to similar selectivity on store_sales. This is not the case for joining store_sales and date_dim on ss_sold_date_sk = d_date_sk since the domain of d_date_sk is much bigger than that of ss_sold_date_sk, differences in domain need to be taken into account when inferring selectivity onto store_sales. -- This message was sent by Atlassian JIRA (v6.2#6252)