hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM
Date Thu, 01 Dec 2011 23:26:40 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161295#comment-13161295
] 

Siddharth Seth commented on MAPREDUCE-3460:
-------------------------------------------

bq. When I run the unit test above I see the hosts(NM) are registered with the RM using "host:port",
but when we request a container in the tests it only has "host" in it. The scheduler seems
to indicate that when it assigns a container to a host it is because it is rack local not
data local. As part of this the host specific request does not seem to be cleared out from
the scheduler even though it is not part of the new ask. If I switch it over to requesting
a container on a particular "host:port" then the scheduler will clear find the container to
be data local, and clear out the host, rack, and * requests. This seems to work OK, but I
thought when we requested a container due to data locality we used just the host name, because
that is what HDFS returns to us.

Good catch! Like you said, the request shouldn't care about the port for data locality. The
FifoScheduler seems to be using the entire nodeAddress for allocating containers - which is
incorrect. The capacity scheduler appears to be working as it should though - using only the
hostname to allocate containers.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it
has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning
any of the ContainerRequests which may have requested a container on this node. This container
request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of
the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the
'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message