hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
Date Wed, 17 Oct 2012 23:48:03 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478496#comment-13478496

Jason Lowe commented on MAPREDUCE-4730:

Here's what I have gathered so far from a heap dump of an AM attempt that was just about to
run out of memory.  Most of the memory was consumed by byte buffers, specifically it looked
like most of them were RPC response buffers.

I think there might be a flow control issue in the IPC layer that lead to this.  More than
half of the mappers finished before the first reducer started, and thousands of reducers all
launched within a few seconds of each other.  They all asked the AM for map completion task
events, which currently caps the response to 10000 events per query.  Since more than 10000
maps completed before the first reducers started, each reducer saw a full event list which
took around 900K for each response buffer.  There were many IPC Handler threads to service
the calls, but only one Responder thread to send out the rather large response buffers.  I
see there's a blocking queue to prevent too many calls from coming in at once, but I didn't
see any flow control between the Handlers and the Responder thread.  It appears that as long
as the Handler threads can keep up with call queue relatively low, they can continue to queue
up call response data faster than the Responder thread can send it out.  Eventually this will
exhaust available memory leading to an OOM.
> AM crashes due to OOM while serving up map task completion events
> -----------------------------------------------------------------
>                 Key: MAPREDUCE-4730
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>            Priority: Blocker
> We're seeing a repeatable OOM crash in the AM for a task with around 30000 maps and 3000
reducers.  Details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message