hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szehon Ho (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
Date Thu, 12 Jul 2018 16:56:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541929#comment-16541929
] 

Szehon Ho edited comment on HIVE-20153 at 7/12/18 4:55 PM:
-----------------------------------------------------------

[~aihuaxu] do you think there is some way to improve this?  (I didn't yet take much look
at this code to deeply understand).   It seems to consume memory whether its used in the
window function or not.

The query is something like (generalizing the table):

select count(distinct), count(), count(), count(), min(), min(), max(), max(), min(), max()
from table group by field;

Also I attach the heap dump of a mapper that was killed OOM for reference, there's 3 million GenericUDAFCountEvaluator,
each with a 'uniqueObjects' hashSet (each hashSet in turn containing a hashMap).

 

 

!Screen Shot 2018-07-12 at 6.41.28 PM.png!

 


was (Author: szehon):
[~aihuaxu] do you think there is some way to improve this?  (I didn't yet take much look
at this code to deeply understand).   It seems to consume memory whether its used in the
window function or not.

The query is something like (generalizing the table):

select count(distinct), count(), count(), count(), min(), min(), max(), max(), min(), max()
from table group by field;

Also I attach the heap dump of a mapper that was killed OOM for reference, there's 3 million GenericUDAFCountEvaluator,
each with a hashset of uniqueObjects.

 

 

!Screen Shot 2018-07-12 at 6.41.28 PM.png!

 

> Count and Sum UDF consume more memory in Hive 2+
> ------------------------------------------------
>
>                 Key: HIVE-20153
>                 URL: https://issues.apache.org/jira/browse/HIVE-20153
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>    Affects Versions: 2.3.2
>            Reporter: Szehon Ho
>            Priority: Major
>         Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and sum() aggregations
run out of memory on Hadoop side much faster than in Hive1.  In many queries, we have to
double the memory.
>  
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' in GeneraicUDAFSum
and GenericUDAFCount, which was added to support Window functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message