hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
Date Tue, 15 Jun 2010 18:32:24 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879061#action_12879061

Ning Zhang commented on HIVE-1139:

I'm not aware of an efficient serde that are reflection based. The XMLEncoder/Decoder is JAXB-based
and are very inefficient. That's another reason we don't want to use it in the execution code
path. A better way is to use the Hive SerDe (e.g., LazyBinarySerde) just as it is done in

Another way to tackle this problem is have more accurate estimate of how many rows can be
fit into the main memory. The current code checks the amount of available memory and use 0.25
(by default) of them to hold the hashmap. We set it to 0.15 in our environment and it works
for most cases. 0.15 is probably a little bit conservative. Some experiments need to be done
to tune this parameter so that most cases will be fit into main memory and only for the exceptional
cases the secondary storage will be used. 

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.5.0
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>         Attachments: PersistentMap.zip
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct
keys in main memory. This could leads to OOM exception when there are too many distinct keys
for a particular mapper. A workaround is to set the map split size smaller so that each mapper
takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently
used in CommonJoinOperator) to spill overflow rows to disk. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message