hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arvind Prabhakar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
Date Wed, 09 Jun 2010 22:27:13 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877239#action_12877239

Arvind Prabhakar commented on HIVE-1139:

Ashish - no problem - let me explain: The problem being addressed by this JIRA is that {{GroupByOperator}}
and possibly other aggregation operators use in-memory maps to store intermediate keys, which
could lead to {{OutOfMemoryException}} in case the number of such keys is large. It is suggested
that one way to work around it is to use the {{HashMapWrapper}} class which would help alleviate
the memory concern since it is capable of spilling the excess data to disk.

The {{HashMapWrapper}} however, uses Java serialization to write out the excess data. This
does not work when the data contains non-serializable objects such as {{Writable}} types -
{{Text}} etc. What I have done so far is to modify the {{HashMapWrapper}} to support full
{{java.util.Map}} interface. However, when I tried updating the {{GroupByOperator}} to use
it, I ran into the said serialization problem.

Thats why I was suggesting that perhaps we should decouple the serialization problem from
enhancing the {{HashMapWrapper}} and let the later be checked independently.

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct
keys in main memory. This could leads to OOM exception when there are too many distinct keys
for a particular mapper. A workaround is to set the map split size smaller so that each mapper
takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently
used in CommonJoinOperator) to spill overflow rows to disk. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message