impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Behm (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (IMPALA-4695) Planner incorrectly estimates cardinality for multi column joins
Date Mon, 03 Jul 2017 00:33:01 GMT

     [ https://issues.apache.org/jira/browse/IMPALA-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Behm resolved IMPALA-4695.
------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.10.0

commit 9f678a74269250bf5c7ae2c5e8afd93c5b3734de
Author: Alex Behm <alex.behm@cloudera.com>
Date:   Tue Jun 6 16:54:41 2017 -0700

    IMPALA-5547: Rework FK/PK join detection.
    
    Reworks the FK/PK join detection logic to:
    - more accurately recognize many-to-many joins
    - avoid dim/dim joins for multi-column PKs
    
    The new detection logic maintains our existing philosophy of generally
    assuming a FK/PK join, unless there is strong evidence to the
    contrary, as follows.
    
    For each set of simple equi-join conjuncts between two tables, we
    compute the joint NDV of the right-hand side columns by
    multiplication, and if the joint NDV is significantly smaller than
    the right-hand side row count, then we are fairly confident that the
    right-hand side is not a PK. Otherwise, we assume the set of conjuncts
    could represent a FK/PK relationship.
    
    Extends the explain plan to include the outcome of the FK/PK detection
    at EXPLAIN_LEVEL > STANDARD.
    
    Performance testing:
    1. Full TPC-DS run on 10TB:
       - Q10 improved by >100x
       - Q72 improved by >25x
       - Q17,Q26,Q29 improved by 2x
       - Q64 regressed by 10x
       - Total runtime: Improved by 2x
       - Geomean: Minor improvement
       The regression of Q64 is understood and we will try to address it
       in follow-on changes. The previous plan was better by accident and
       not because of superior logic.
    2. Nightly TPC-H and TPC-DS runs:
       - No perf differences
    
    Testing:
    - The existing planner test cover the changes.
    - Code/hdfs run passed.
    
    Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
    Reviewed-on: http://gerrit.cloudera.org:8080/7257
    Reviewed-by: Alex Behm <alex.behm@cloudera.com>
    Tested-by: Impala Public Jenkins


> Planner incorrectly estimates cardinality for multi column joins
> ----------------------------------------------------------------
>
>                 Key: IMPALA-4695
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4695
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Frontend
>    Affects Versions: Impala 2.8.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Alexander Behm
>              Labels: planner, tpc-ds
>             Fix For: Impala 2.10.0
>
>
> When working on IMPALA-4174 realized that the planner doesn't handle cardinality estimation
for multi column joins. 
> Another variation of the same problem is when the primary key is consistent of multiple
columns, for TPC-DS store_sales the primary key is (ss_ticket_number, ss_item_sk) and for
store_returns the primary key is (sr_ticket_number, sr_item_sk), so the join condition below
is a PK/PK join and the current algorithm doesn't handle that case and the estimate for the
join uses the many to many calculation which is incorrect 
> {code}
> SELECT count(*) 
> FROM   store_sales, 
>        store_returns 
> WHERE  ss_customer_sk = sr_customer_sk 
>        AND ss_item_sk = sr_item_sk 
>        AND ss_ticket_number = sr_ticket_number 
> {code}
> {code}
> +-------------------------------------------------------------------------------------------------------------------+
> | Explain String                                                                    
                               |
> +-------------------------------------------------------------------------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=1.15GB VCores=2                           
                               |
> |                                                                                   
                               |
> | PLAN-ROOT SINK                                                                    
                               |
> | |                                                                                 
                               |
> | 07:AGGREGATE [FINALIZE]                                                           
                               |
> | |  output: count:merge(*)                                                         
                               |
> | |  hosts=20 per-host-mem=unavailable                                              
                               |
> | |  tuple-ids=2 row-size=8B cardinality=1                                          
                               |
> | |                                                                                 
                               |
> | 06:EXCHANGE [UNPARTITIONED]                                                       
                               |
> | |  hosts=20 per-host-mem=unavailable                                              
                               |
> | |  tuple-ids=2 row-size=8B cardinality=1                                          
                               |
> | |                                                                                 
                               |
> | 03:AGGREGATE                                                                      
                               |
> | |  output: count(*)                                                               
                               |
> | |  hosts=20 per-host-mem=10.00MB                                                  
                               |
> | |  tuple-ids=2 row-size=8B cardinality=1                                          
                               |
> | |                                                                                 
                               |
> | 02:HASH JOIN [INNER JOIN, PARTITIONED]                                            
                               |
> | |  hash predicates: ss_customer_sk = sr_customer_sk, ss_item_sk = sr_item_sk, ss_ticket_number
= sr_ticket_number |
> | |  runtime filters: RF000 <- sr_customer_sk, RF001 <- sr_item_sk, RF002 <-
sr_ticket_number                       |
> | |  hosts=20 per-host-mem=906.36MB                                                 
                               |
> | |  tuple-ids=0,1 row-size=40B cardinality=15477599674736                          
                               |
> | |                                                                                 
                               |
> | |--05:EXCHANGE [HASH(sr_customer_sk,sr_item_sk,sr_ticket_number)]                 
                               |
> | |  |  hosts=20 per-host-mem=0B                                                    
                               |
> | |  |  tuple-ids=1 row-size=20B cardinality=863989652                              
                               |
> | |  |                                                                              
                               |
> | |  01:SCAN HDFS [tpcds_3000_parquet.store_returns, RANDOM]                        
                               |
> | |     partitions=2004/2004 files=2008 size=48.92GB                                
                               |
> | |     table stats: 863989652 rows total                                           
                               |
> | |     column stats: all                                                           
                               |
> | |     hosts=20 per-host-mem=120.00MB                                              
                               |
> | |     tuple-ids=1 row-size=20B cardinality=863989652                              
                               |
> | |                                                                                 
                               |
> | 04:EXCHANGE [HASH(ss_customer_sk,ss_item_sk,ss_ticket_number)]                    
                               |
> | |  hosts=20 per-host-mem=0B                                                       
                               |
> | |  tuple-ids=0 row-size=20B cardinality=8639936081                                
                               |
> | |                                                                                 
                               |
> | 00:SCAN HDFS [tpcds_3000_parquet.store_sales, RANDOM]                             
                               |
> |    partitions=1824/1824 files=2649 size=376.48GB                                  
                               |
> |    runtime filters: RF000 -> ss_customer_sk, RF001 -> ss_item_sk, RF002 ->
ss_ticket_number                       |
> |    table stats: 8639936081 rows total                                             
                               |
> |    column stats: all                                                              
                               |
> |    hosts=20 per-host-mem=264.00MB                                                 
                               |
> |    tuple-ids=0 row-size=20B cardinality=8639936081                                
                               |
> +-------------------------------------------------------------------------------------------------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message