impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Behm (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (IMPALA-4174) Planner incorrectly estimates cardinality for many to many joins
Date Mon, 03 Jul 2017 00:34:00 GMT

     [ https://issues.apache.org/jira/browse/IMPALA-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Behm resolved IMPALA-4174.
------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.10.0

commit 9f678a74269250bf5c7ae2c5e8afd93c5b3734de
Author: Alex Behm <alex.behm@cloudera.com>
Date:   Tue Jun 6 16:54:41 2017 -0700

    IMPALA-5547: Rework FK/PK join detection.
    
    Reworks the FK/PK join detection logic to:
    - more accurately recognize many-to-many joins
    - avoid dim/dim joins for multi-column PKs
    
    The new detection logic maintains our existing philosophy of generally
    assuming a FK/PK join, unless there is strong evidence to the
    contrary, as follows.
    
    For each set of simple equi-join conjuncts between two tables, we
    compute the joint NDV of the right-hand side columns by
    multiplication, and if the joint NDV is significantly smaller than
    the right-hand side row count, then we are fairly confident that the
    right-hand side is not a PK. Otherwise, we assume the set of conjuncts
    could represent a FK/PK relationship.
    
    Extends the explain plan to include the outcome of the FK/PK detection
    at EXPLAIN_LEVEL > STANDARD.
    
    Performance testing:
    1. Full TPC-DS run on 10TB:
       - Q10 improved by >100x
       - Q72 improved by >25x
       - Q17,Q26,Q29 improved by 2x
       - Q64 regressed by 10x
       - Total runtime: Improved by 2x
       - Geomean: Minor improvement
       The regression of Q64 is understood and we will try to address it
       in follow-on changes. The previous plan was better by accident and
       not because of superior logic.
    2. Nightly TPC-H and TPC-DS runs:
       - No perf differences
    
    Testing:
    - The existing planner test cover the changes.
    - Code/hdfs run passed.
    
    Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
    Reviewed-on: http://gerrit.cloudera.org:8080/7257
    Reviewed-by: Alex Behm <alex.behm@cloudera.com>
    Tested-by: Impala Public Jenkins


> Planner incorrectly estimates cardinality for many to many joins
> ----------------------------------------------------------------
>
>                 Key: IMPALA-4174
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4174
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Frontend
>    Affects Versions: Impala 2.5.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Alexander Behm
>              Labels: planner, tpc-ds
>             Fix For: Impala 2.10.0
>
>
> The planner incorrectly estimates the cardinality for many to many joins, this can results
in inefficient join ordering when many to many relations exist in a query. 
> {code}
> show column stats nation;
> {code}
> | Column     || Type   ||#Distinct Values|| #Nulls ||Max Size|| Avg Size          |
> | n_nationkey | BIGINT | 25               | -1     | 8        | 8                 |
> | n_name      | STRING | 25               | -1     | 14       | 7.079999923706055 |
> | n_regionkey | BIGINT | 5                | -1     | 8        | 8                 |
> | n_comment   | STRING | 25               | -1     | 114      | 74.27999877929688 |
> | #Rows ||#Files|| Size|| Bytes Cached || Cache Replication || Format  || Incremental
stats || Location|
> | 25    | 1      | 2.19KB | NOT CACHED   | NOT CACHED        | PARQUET | false      
      | hdfs://d2412.halxg.cloudera.com:8020/user/hive/warehouse/tpch_3000_parquet.db/nation
|
> Plan
> {code}
> explain select count(*) from nation n1, nation n2 where n1.n_regionkey = n2.n_regionkey;
> +----------------------------------------------------------+
> | Explain String                                           |
> +----------------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=26.00MB VCores=2 |
> |                                                          |
> | 06:AGGREGATE [FINALIZE]                                  |
> | |  output: count:merge(*)                                |
> | |  hosts=1 per-host-mem=unavailable                      |
> | |  tuple-ids=2 row-size=8B cardinality=1                 |
> | |                                                        |
> | 05:EXCHANGE [UNPARTITIONED]                              |
> | |  hosts=1 per-host-mem=unavailable                      |
> | |  tuple-ids=2 row-size=8B cardinality=1                 |
> | |                                                        |
> | 03:AGGREGATE                                             |
> | |  output: count(*)                                      |
> | |  hosts=1 per-host-mem=10.00MB                          |
> | |  tuple-ids=2 row-size=8B cardinality=1                 |
> | |                                                        |
> | 02:HASH JOIN [INNER JOIN, BROADCAST]                     |
> | |  hash predicates: n1.n_regionkey = n2.n_regionkey      |
> | |  hosts=1 per-host-mem=221B                             |
> | |  tuple-ids=0,1 row-size=16B cardinality=25             |
> | |                                                        |
> | |--04:EXCHANGE [BROADCAST]                               |
> | |  |  hosts=1 per-host-mem=0B                            |
> | |  |  tuple-ids=1 row-size=8B cardinality=25             |
> | |  |                                                     |
> | |  01:SCAN HDFS [tpch_3000_parquet.nation n2, RANDOM]    |
> | |     partitions=1/1 files=1 size=2.19KB                 |
> | |     table stats: 25 rows total                         |
> | |     column stats: all                                  |
> | |     hosts=1 per-host-mem=16.00MB                       |
> | |     tuple-ids=1 row-size=8B cardinality=25             |
> | |                                                        |
> | 00:SCAN HDFS [tpch_3000_parquet.nation n1, RANDOM]       |
> |    partitions=1/1 files=1 size=2.19KB                    |
> |    table stats: 25 rows total                            |
> |    column stats: all                                     |
> |    hosts=1 per-host-mem=16.00MB                          |
> |    tuple-ids=0 row-size=8B cardinality=25                |
> +----------------------------------------------------------+
> {code}
> Note that the estimate cardinality for the JOIN is 25 where it should be 125
> | Operator        || #Hosts || Avg Time || Max Time || #Rows || Est. #Rows || Peak Mem
|| Est. Peak Mem || Detail                      |
> | 06:AGGREGATE    | 1      | 76.13ms  | 76.13ms  | 1     | 1          | 16.00 KB | -1
B          | FINALIZE                    |
> | 05:EXCHANGE     | 1      | 32.51us  | 32.51us  | 1     | 1          | 0 B      | -1
B          | UNPARTITIONED               |
> | 03:AGGREGATE    | 1      | 44.39ms  | 44.39ms  | 1     | 1          | 44.00 KB | 10.00
MB      |                             |
> | 02:HASH JOIN    | 1      | 6.77ms   | 6.77ms   | 125   | 25         | 2.10 MB  | 221
B         | INNER JOIN, BROADCAST       |
> | I--04:EXCHANGE  | 1      | 9.12us   | 9.12us   | 25    | 25         | 0 B      | 0
B           | BROADCAST                   |
> | I  01:SCAN HDFS | 1      | 51.48ms  | 51.48ms  | 25    | 25         | 53.00 KB | 16.00
MB      | tpch_3000_parquet.nation n2 |
> | 00:SCAN HDFS    | 1      | 3.25ms   | 3.25ms   | 25    | 25         | 60.00 KB | 16.00
MB      | tpch_3000_parquet.nation n1 |



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message