hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2821) Distributed shell app master becomes unresponsive sometimes
Date Thu, 06 Nov 2014 20:54:37 GMT

    [ https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200904#comment-14200904
] 

Varun Vasudev commented on YARN-2821:
-------------------------------------

The root cause appears to be an unexpected over-allocation. In this case the app master got
allocated one more container than it expected and went into an infinite loop in the finish
function. With regards to the extra container, it's possible we're seeing a variant of YARN-110.
Unfortunately the RM doesn't log asks so we can't tell the sequence of asks that led to the
extra allocation.

> Distributed shell app master becomes unresponsive sometimes
> -----------------------------------------------------------
>
>                 Key: YARN-2821
>                 URL: https://issues.apache.org/jira/browse/YARN-2821
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications/distributed-shell
>    Affects Versions: 2.5.1
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>
> We've noticed that once in a while the distributed shell app master becomes unresponsive
and is eventually killed by the RM. snippet of the logs -
> {noformat}
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: appattempt_1415123350094_0017_000001
received 0 previous attempts' running containers on AM registration.
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : onprem-tez2:45454
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from RM for container
ask, allocatedCnt=1
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000002, containerNode=onprem-tez2:45454,
containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER
for Container container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER
for Container container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez3:45454
> 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez4:45454
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from RM for container
ask, allocatedCnt=3
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000003, containerNode=onprem-tez2:45454,
containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000004, containerNode=onprem-tez3:45454,
containerNodeURI=onprem-tez3:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000005, containerNode=onprem-tez4:45454,
containerNodeURI=onprem-tez4:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER
for Container container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER
for Container container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez4:45454
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER
for Container container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez3:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER
for Container container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez4:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER
for Container container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER
for Container container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez3:45454
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from RM for container
ask, completedCnt=1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: appattempt_1415123350094_0017_000001
got container status for containerID=container_1415123350094_0017_01_000002, state=COMPLETE,
exitStatus=0, diagnostics=
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Container completed successfully.,
containerId=container_1415123350094_0017_01_000002
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from RM for container
ask, allocatedCnt=2
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000006, containerNode=onprem-tez2:45454,
containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000007, containerNode=onprem-tez3:45454,
containerNodeURI=onprem-tez3:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000007
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000006
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message