hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off
Date Wed, 25 Nov 2015 01:55:10 GMT

    [ https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025977#comment-15025977
] 

Ashutosh Chauhan commented on HIVE-12491:
-----------------------------------------

I guess what Gopal is pointing out is multiple PK case is missing which might help this use
case. (as demonstrated in his WIP patch). 
Other thing is we failed to recognize that out of 3 columns, two are different udfs on same
column, so we incorrectly computed denom for that. Ideally, we need to fix both but doing
atleast one of these two will help.

> Column Statistics: 3 attribute join on a 2-source table is off
> --------------------------------------------------------------
>
>                 Key: HIVE-12491
>                 URL: https://issues.apache.org/jira/browse/HIVE-12491
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.3.0, 2.0.0
>            Reporter: Gopal V
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-12491.WIP.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different attributes.
> {code}
>   private Long getEasedOutDenominator(List<Long> distinctVals) {
>       // Exponential back-off for NDVs.
>       // 1) Descending order sort of NDVs
>       // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * ....
>       Collections.sort(distinctVals, Collections.reverseOrder());
>       long denom = distinctVals.get(0);
>       for (int i = 1; i < distinctVals.size(); i++) {
>         denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << i)));
>       }
>       return denom;
>     }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 of which
are derived from the same column.
> {code}
>         Reduce Output Operator (RS_12)
>           key expressions: _col0 (type: bigint), year(_col2) (type: int), month(_col2)
(type: int)
>           sort order: +++
>           Map-reduce partition columns: _col0 (type: bigint), year(_col2) (type: int),
month(_col2) (type: int)
>           value expressions: _col1 (type: bigint)
>           Join Operator (JOIN_13)
>             condition map:
>                  Inner Join 0 to 1
>             keys:
>               0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) (type: int)
>               1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type: int)
>             outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message