hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off
Date Wed, 25 Nov 2015 23:01:11 GMT

     [ https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gopal V updated HIVE-12491:
---------------------------
    Description: 
The eased out denominator has to detect duplicate row-stats from different attributes.

{code}
select account_id from customers c,  customer_activation ca
  where c.customer_id = ca.customer_id
  and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
  and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
{code}

{code}
  private Long getEasedOutDenominator(List<Long> distinctVals) {
      // Exponential back-off for NDVs.
      // 1) Descending order sort of NDVs
      // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * ....
      Collections.sort(distinctVals, Collections.reverseOrder());

      long denom = distinctVals.get(0);
      for (int i = 1; i < distinctVals.size(); i++) {
        denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << i)));
      }

      return denom;
    }
{code}

This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 of which are
derived from the same column.

{code}
        Reduce Output Operator (RS_12)
          key expressions: _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type:
int)
          sort order: +++
          Map-reduce partition columns: _col0 (type: bigint), year(_col2) (type: int), month(_col2)
(type: int)
          value expressions: _col1 (type: bigint)
          Join Operator (JOIN_13)
            condition map:
                 Inner Join 0 to 1
            keys:
              0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) (type: int)
              1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type: int)
            outputColumnNames: _col3
{code}

So the eased out denominator is off by a factor of 30,000 or so, causing OOMs in map-joins.

  was:
The eased out denominator has to detect duplicate row-stats from different attributes.

{code}
  private Long getEasedOutDenominator(List<Long> distinctVals) {
      // Exponential back-off for NDVs.
      // 1) Descending order sort of NDVs
      // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * ....
      Collections.sort(distinctVals, Collections.reverseOrder());

      long denom = distinctVals.get(0);
      for (int i = 1; i < distinctVals.size(); i++) {
        denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << i)));
      }

      return denom;
    }
{code}

This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 of which are
derived from the same column.

{code}
        Reduce Output Operator (RS_12)
          key expressions: _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type:
int)
          sort order: +++
          Map-reduce partition columns: _col0 (type: bigint), year(_col2) (type: int), month(_col2)
(type: int)
          value expressions: _col1 (type: bigint)
          Join Operator (JOIN_13)
            condition map:
                 Inner Join 0 to 1
            keys:
              0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) (type: int)
              1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type: int)
            outputColumnNames: _col3
{code}

So the eased out denominator is off by a factor of 30,000 or so, causing OOMs in map-joins.


> Column Statistics: 3 attribute join on a 2-source table is off
> --------------------------------------------------------------
>
>                 Key: HIVE-12491
>                 URL: https://issues.apache.org/jira/browse/HIVE-12491
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.3.0, 2.0.0
>            Reporter: Gopal V
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-12491.WIP.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List<Long> distinctVals) {
>       // Exponential back-off for NDVs.
>       // 1) Descending order sort of NDVs
>       // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * ....
>       Collections.sort(distinctVals, Collections.reverseOrder());
>       long denom = distinctVals.get(0);
>       for (int i = 1; i < distinctVals.size(); i++) {
>         denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << i)));
>       }
>       return denom;
>     }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 of which
are derived from the same column.
> {code}
>         Reduce Output Operator (RS_12)
>           key expressions: _col0 (type: bigint), year(_col2) (type: int), month(_col2)
(type: int)
>           sort order: +++
>           Map-reduce partition columns: _col0 (type: bigint), year(_col2) (type: int),
month(_col2) (type: int)
>           value expressions: _col1 (type: bigint)
>           Join Operator (JOIN_13)
>             condition map:
>                  Inner Join 0 to 1
>             keys:
>               0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) (type: int)
>               1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type: int)
>             outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message