hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM
Date Wed, 30 Nov 2011 23:15:40 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160439#comment-13160439

Robert Joseph Evans commented on MAPREDUCE-3460:

I think I must be doing it wrong some how or I don't understand the order of things you are
requesting.  I am doing the following at it passes on both

# request _1 on h1
# am heartbeat()
# h1 heartbeat()
# am heartbeat() //Get _1 container back
# fail _1 so h1 is blacklisted
# request _3 on h3
# request fast fail map _2 on h1
... (More heartbeats to schedule things)

This does not work to reproduce the issue because any requests for h1 added after h1 is blacklisted
will have h1 removed.

If I move the fast fail map request above h1 being blacklisted then when the container request
comes back for h1 it sees that it is blacklisted.  It will not find the request in the mapsHostMapping
and will result to pulling a request out of maps, which still works.  The only way we are
going to get this deadlock is if some how maps is empty.  I don't really see how the patch
changes that.  I really don't understand all of what the code is doing so I could just be
completely wrong about it. 

> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it
has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning
any of the ContainerRequests which may have requested a container on this node. This container
request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of
the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the
'ask' set - which is what is sent to the RM.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message