hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Sun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15477) Provide options to adjust filter stats when column stats are not available
Date Wed, 21 Dec 2016 21:12:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768176#comment-15768176
] 

Chao Sun commented on HIVE-15477:
---------------------------------

[~prasanth_j] can you elaborate on what mis-estimate can be done with "join_key_column IS
NOT NULL" predicates? I'm also curious why it is added to Hive. I was looking at {{evaluateNotNullExpr}}
but seems it just return the input # of rows when column stats are not present? (looking at
here: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L586)

Yeah I totally agree that we are going to make wrong estimates even with configs. It's very
difficult to get 100% accurate stats. But with some configs we can at least add some manual
intervention. :)


> Provide options to adjust filter stats when column stats are not available
> --------------------------------------------------------------------------
>
>                 Key: HIVE-15477
>                 URL: https://issues.apache.org/jira/browse/HIVE-15477
>             Project: Hive
>          Issue Type: Bug
>          Components: Statistics
>    Affects Versions: 2.2.0
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-15477.1.patch
>
>
> Currently when column stats are not available, Hive will assume the "worst" case by setting
the # of output rows to be 1/2 of the # of input rows, for each predicate expression. This
could be inaccurate, especially in the presence of multiple predicates chained by AND. We
have found in some cases this could cause map join to have wrong ordering and thus fail with
memory issue.
> One suggestion is to provide a config (such as {{hive.stats.filter.factor}}) that can
be used to control the percentage of rows emitted by a predicate expression. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message