hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5043) Fetch failure processing can cause AM event queue to backup and eventually OOM
Date Sat, 02 Mar 2013 22:33:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591550#comment-13591550

Hadoop QA commented on MAPREDUCE-5043:

{color:green}+1 overall{color}.  Here are the results of testing the latest attachment 
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 2 new or modified
test files.

    {color:green}+1 tests included appear to have a timeout.{color}

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version
1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3377//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3377//console

This message is automatically generated.
> Fetch failure processing can cause AM event queue to backup and eventually OOM
> ------------------------------------------------------------------------------
>                 Key: MAPREDUCE-5043
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5043
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.7, 2.0.4-beta
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-5043.patch
> Saw an MRAppMaster with a 3G heap OOM.  Upon investigating another instance of it running,
we saw the UI in a weird state where the task table and task attempt tables in the job overview
page weren't consistent.  The AM log showed the AsyncDispatcher had hundreds of thousands
of events in the event queue, and jstacks showed it spending a lot of time in fetch failure
processing.  It turns out fetch failure processing is currently *very* expensive, with a triple
{{for}} loop where the inner loop is calling the quite-expensive {{TaskAttempt.getReport}}.
 That function ends up type-converting the entire task report, counters and all, and performing
locale conversions among other things.  It does this for every reduce task in the job, for
every map task that failed.  And when it's done building up the large task report, it pulls
out one field, the phase, then throws the report away.
> While the AM is busy processing fetch failures, tasks attempts are continuing to send
events to the AM including memory-expensive events like status updates which include the counters.
 These back up in the AsyncDispatcher event queue and eventually even an AM with a large heap
size will run out of memory and crash or expire because it thrashes in garbage collect.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message