impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Tauber-Marshall (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (IMPALA-5725) coalesce() not being fully applied with outer joins on kudu tables
Date Mon, 07 Aug 2017 22:11:00 GMT

     [ https://issues.apache.org/jira/browse/IMPALA-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thomas Tauber-Marshall resolved IMPALA-5725.
--------------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.10.0

commit 2ae94e7ead090b3e80b7a75fee7026f2fe8d8ca9
Author: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Date:   Wed Aug 2 10:51:57 2017 -0700

    IMPALA-5725: coalesce() with outer join incorrectly rewritten
    
    A recent change, IMPALA-5016, added an expr rewrite rule to simplfy
    coalesce(). This rule eliminates the coalesce() when its first
    parameter (that isn't constant null) is a SlotRef pointing to a
    SlotDescriptor that is non-nullable (for example because it is from
    a non-nullable Kudu column or because it is from an HDFS partition
    column with no null partitions), under the assumption that the SlotRef
    could never have a null value.
    
    This assumption is violated when the SlotRef is the output of an
    outer join, leading to incorrect results being returned. The problem
    is that the nullability of a SlotDescriptor (which determines whether
    there is a null indicator bit in the tuple for that slot) is a
    slightly different property than the nullability of a SlotRef pointing
    to that SlotDescriptor (since the SlotRef can still be NULL if the
    entire tuple is NULL).
    
    This patch removes the portion of the rewrite rule that considers
    the nullability of the SlotDescriptor. This means that we're missing
    out on some optimizations opportunities and we should revisit this in
    a way that works with outer joins (IMPALA-5753)
    
    Testing:
    - Updated FE tests.
    - Added regression tests to exprs.test
    
    Change-Id: I1ca6df949f9d416ab207016236dbcb5886295337
    Reviewed-on: http://gerrit.cloudera.org:8080/7567
    Reviewed-by: Matthew Jacobs <mj@cloudera.com>
    Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
    Tested-by: Impala Public Jenkins

> coalesce() not being fully applied with outer joins on kudu tables
> ------------------------------------------------------------------
>
>                 Key: IMPALA-5725
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5725
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.10.0
>            Reporter: Michael Brown
>            Assignee: Thomas Tauber-Marshall
>            Priority: Blocker
>              Labels: correctness, kudu, query_generator, regression
>             Fix For: Impala 2.10.0
>
>         Attachments: explain-l2-2.10.txt, explain-l2-2.9.txt, profile-2.10.txt, profile-2.9.txt
>
>
> {{SELECT COALESCE()}} on multiple arguments spanning different tables with an {{OUTER
JOIN}} on Kudu tables is not properly being applied. This behavior is
> # different relative to Kudu tables in 2.9
> # different relative to the 2.10 behavior with HDFS, seemingly making this Kudu-specific
> # different from Postgres, which matches the HDFS behavior, further making this seem
Kudu-specific
> Consider this query:
> {noformat}
> USE tpch_kudu;
> SELECT
> COALESCE(a2.n_nationkey, a1.p_size),
> a2.n_nationkey,
> a1.p_size
> FROM part a1
> LEFT JOIN nation a2 ON (a1.p_size) = (a2.n_nationkey);
> {noformat}
> Some of the rows returned include:
> {noformat}
> +-------------------------------------+-------------+--------+
> | coalesce(a2.n_nationkey, a1.p_size) | n_nationkey | p_size |
> +-------------------------------------+-------------+--------+
> [snip]
> | 21                                  | 21          | 21     |
> | 22                                  | 22          | 22     |
> | 23                                  | 23          | 23     |
> | 24                                  | 24          | 24     |
> | NULL                                | NULL        | 25     |
> | NULL                                | NULL        | 26     |
> | NULL                                | NULL        | 27     |
> [snip]
> {noformat}
> The {{COALESCE()}} column is not returning the value of {{p_size}} when its first argument,
{{n_nationkey}} is {{NULL}}. {{tpch_kudu.nation n_nationkey}} has values between 0 and 24,
hence the {{NULL}} values in that column when {{part.p_size}} is greater.
> This goes away if you keep the query above but switch the ordering of the {{COALESCE()}}
arguments.
> I can see the same sort of problems if I write similar {{RIGHT}} or {{FULL OUTER JOIN}}
queries:
> {noformat}
> USE tpch_kudu;
> SELECT
> DISTINCT
> COALESCE(a2.n_nationkey, a1.p_size),
> a2.n_nationkey,
> a1.p_size
> FROM part a1
> FULL OUTER JOIN nation a2 ON (a1.p_size) = (a2.n_nationkey)
> ORDER BY 1,2,3;
> {noformat}
> {noformat}
> USE tpch_kudu;
> SELECT
> DISTINCT
> COALESCE(a2.n_nationkey, a1.p_size),
> a2.n_nationkey,
> a1.p_size
> FROM nation a2
> RIGHT JOIN part a1 ON (a1.p_size) = (a2.n_nationkey)
> ORDER BY 1,2,3;
> {noformat}
> Explain-level 2 plans and profiles will be attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message