From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM
Date Fri, 02 Dec 2011 15:09:40 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161680#comment-13161680

Robert Joseph Evans commented on MAPREDUCE-3460:

I don't know for sure if the test simulates the situation or not yet, but yesterday before
I left one of the tests we were running got into this situation and I was able to poke around
a little bit.  I have the complete set of logs for the AM and RM during that time, and I am
walking through the logs now to try and understand exactly what happened, and try to reproduce

>From what I have seen so far the following is the set of events.
2011-12-01 19:05:48,480 ASSIGNED CONTAINER container_1322524316055_0237_01_000002 TO HOST
2011-12-01 19:05:48,483 ASSIGNED CONTAINER container_1322524316055_0237_01_000003 TO HOST
2011-12-01 19:05:50,469 ASSIGNED CONTAINER container_1322524316055_0237_01_000002 TO ATTEMPT
2011-12-01 19:05:50,476 ASSIGNED CONTAINER container_1322524316055_0237_01_000003 TO ATTEMPT
2011-12-01 19:06:11,541 ASSIGNED CONTAINER container_1322524316055_0237_01_000004 TO HOST
2011-12-01 19:06:11,542 ASSIGNED CONTAINER container_1322524316055_0237_01_000005 TO HOST
2011-12-01 19:06:12,539 ATTEMPT attempt_1322524316055_0237_m_000000_0 FAILED
2011-12-01 19:06:12,540 ATTEMPT attempt_1322524316055_0237_m_000001_0 FAILED
2011-12-01 19:06:12,545 ASSIGNED CONTAINER container_1322524316055_0237_01_000004 TO ATTEMPT
2011-12-01 19:06:12,555 ASSIGNED CONTAINER container_1322524316055_0237_01_000005 TO ATTEMPT
2011-12-01 19:06:12,573 1 FAILURES ON H2
2011-12-01 19:06:12,574 2 FAILURES ON H2
2011-12-01 19:06:20,573 ASSIGNED CONTAINER container_1322524316055_0237_01_000006 TO HOST
2011-12-01 19:06:20,574 ASSIGNED CONTAINER container_1322524316055_0237_01_000007 TO HOST
2011-12-01 19:06:20,585 ATTEMPT attempt_1322524316055_0237_m_000002_0 FAILED
2011-12-01 19:06:20,586 ATTEMPT attempt_1322524316055_0237_m_000003_0 FAILED
2011-12-01 19:06:20,589 ASSIGNED CONTAINER container_1322524316055_0237_01_000006 TO ATTEMPT
2011-12-01 19:06:20,592 ASSIGNED CONTAINER container_1322524316055_0237_01_000007 TO ATTEMPT
2011-12-01 19:06:20,605 3 FAILURES ON H2
2011-12-01 19:06:20,607 4 FAILURES ON H2
2011-12-01 19:06:20,608 BLACKLISTED H2
2011-12-01 19:06:23,998 ASSIGNED CONTAINER container_1322524316055_0237_01_000008 TO HOST
2011-12-01 19:06:23,999 ASSIGNED CONTAINER container_1322524316055_0237_01_000009 TO HOST
2011-12-01 19:06:26,647 ASSIGNED CONTAINER container_1322524316055_0237_01_000010 TO HOST
2011-12-01 19:06:26,649 ASSIGNED CONTAINER container_1322524316055_0237_01_000011 TO HOST
2011-12-01 19:06:28,635 ASSIGNED CONTAINER container_1322524316055_0237_01_000010 TO ATTEMPT
2011-12-01 19:06:28,640 ASSIGNED CONTAINER container_1322524316055_0237_01_000011 TO ATTEMPT
2011-12-01 19:06:40,839 ASSIGNED CONTAINER container_1322524316055_0237_01_000012 TO HOST
2011-12-01 19:06:40,840 ASSIGNED CONTAINER container_1322524316055_0237_01_000013 TO HOST
2011-12-01 19:06:42,675 ASSIGNED CONTAINER container_1322524316055_0237_01_000012 TO ATTEMPT
2011-12-01 19:06:42,682 ASSIGNED CONTAINER container_1322524316055_0237_01_000013 TO ATTEMPT
2011-12-01 19:06:45,698 ASSIGNED CONTAINER container_1322524316055_0237_01_000014 TO HOST
2011-12-01 19:06:45,699 ASSIGNED CONTAINER container_1322524316055_0237_01_000015 TO HOST
2011-12-01 19:06:46,698 ASSIGNED CONTAINER container_1322524316055_0237_01_000014 TO ATTEMPT
2011-12-01 19:06:46,703 ASSIGNED CONTAINER container_1322524316055_0237_01_000015 TO ATTEMPT

After that it looks like the scheduler has several requested container to assign, but it never
assigns any of them, and the AM never asks for anything new.
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it
has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning
any of the ContainerRequests which may have requested a container on this node. This container
request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of
the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the
'ask' set - which is what is sent to the RM.

