hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation
Date Tue, 02 Jun 2009 22:34:07 GMT

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715727#action_12715727
] 

Zheng Shao commented on HIVE-535:
---------------------------------

Some details about A3:
We use a special hashmap that the value part has to be a primitive int (4 bytes). We let he
UDAFs manage the array of aggregation results, and store the index of the aggregation results
in the hashTable.
{code}
  SELECT departmentid, count(1), sum(revenue)
  FROM sales
  GROUP BY departmentid;

  on a new departmentid:
  int key_id = -1;
  if (!hashmap.contains(departmentid)) {
    key_id = hashmap.size() + 1;
    hashmap.insert(departmentid, key_id);
    count_1.initAggregation(key_id);
    sum_revenue.initAggregation(key_id);
  } else {
    key_id = hashmap.get(departmentid);
  }
  count_1.iterate(key_id, 1);
  sum_revenue.iterate(key_id, revenue);

{code}


> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable,
and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead
of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead
per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage
the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage
a double[] and a long[]. The external code should pass an index to iterate/merge/terminal
an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message