hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation
Date Mon, 22 Feb 2010 19:22:28 GMT

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836841#action_12836841

Zheng Shao commented on HIVE-535:

Actually colt.* has a different license. I believe it is compatible with Apache but we should
definitely get a lawyer before including it (if ever needed).

Packages cern.colt* , cern.jet*, cern.clhep

Copyright (c) 1999 CERN - European Organization for Nuclear Research.
Permission to use, copy, modify, distribute and sell this software and its documentation for
any purpose is hereby granted without fee, provided that the above copyright notice appear
in all copies and that both that copyright notice and this permission notice appear in supporting
documentation. CERN makes no representations about the suitability of this software for any
purpose. It is provided "as is" without expressed or implied warranty.

> Memory-efficient hash-based Aggregation
> ---------------------------------------
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable,
and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead
of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead
per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage
the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage
a double[] and a long[]. The external code should pass an index to iterate/merge/terminal
an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message