hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Sun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15489) Alternatively use table scan stats for HoS
Date Wed, 08 Feb 2017 23:05:41 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858665#comment-15858665
] 

Chao Sun commented on HIVE-15489:
---------------------------------

One issue with the current approach is the JOIN operator we are looking at could be impacted
by upstream joins/aggregations

{code}
      M1   M2
       \  /
(JOIN 1) R1     M3
         \     /
          \   R2
           \ /
            R3 (JOIN 2)
{code}
Here there are multiple reduce phases before getting to {{JOIN 2}}, which could affect the
data size a lot.
To minimize this inaccuracy, I propose that *we should only use TS stats if there is no RS
between the JOIN and all roots reachable from it.*
In the above, {{JOIN 1}} satisfies the condition while {{JOIN 2}} does not.

> Alternatively use table scan stats for HoS
> ------------------------------------------
>
>                 Key: HIVE-15489
>                 URL: https://issues.apache.org/jira/browse/HIVE-15489
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark, Statistics
>    Affects Versions: 2.2.0
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-15489.1.patch, HIVE-15489.2.patch, HIVE-15489.wip.patch
>
>
> For MapJoin in HoS, we should provide an option to only use stats in the TS rather than
the populated stats in each of the join branch. This could be pretty conservative but more
reliable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message