hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-2821) Distributed shell app master becomes unresponsive sometimes
Date Fri, 07 Nov 2014 05:54:34 GMT

     [ https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Varun Vasudev updated YARN-2821:
--------------------------------
    Attachment: apache-yarn-2821.1.patch

Thanks for the review Jian! I thought about changing the comparison but it feels like treating
the symptom. I'd like to get it to work right without changing that if possible. 

Thanks for pointing out the increment in onStartContainerError, I've addressed that as well
as made some more fixes in the latest patch.

> Distributed shell app master becomes unresponsive sometimes
> -----------------------------------------------------------
>
>                 Key: YARN-2821
>                 URL: https://issues.apache.org/jira/browse/YARN-2821
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications/distributed-shell
>    Affects Versions: 2.5.1
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: apache-yarn-2821.0.patch, apache-yarn-2821.1.patch
>
>
> We've noticed that once in a while the distributed shell app master becomes unresponsive
and is eventually killed by the RM. snippet of the logs -
> {noformat}
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: appattempt_1415123350094_0017_000001
received 0 previous attempts' running containers on AM registration.
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[<memory:10,
vCores:1>]Priority[0]
> 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : onprem-tez2:45454
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from RM for container
ask, allocatedCnt=1
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000002, containerNode=onprem-tez2:45454,
containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER
for Container container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER
for Container container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez3:45454
> 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez4:45454
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from RM for container
ask, allocatedCnt=3
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000003, containerNode=onprem-tez2:45454,
containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000004, containerNode=onprem-tez3:45454,
containerNodeURI=onprem-tez3:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000005, containerNode=onprem-tez4:45454,
containerNodeURI=onprem-tez4:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER
for Container container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER
for Container container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez4:45454
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER
for Container container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez3:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER
for Container container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez4:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER
for Container container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER
for Container container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez3:45454
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from RM for container
ask, completedCnt=1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: appattempt_1415123350094_0017_000001
got container status for containerID=container_1415123350094_0017_01_000002, state=COMPLETE,
exitStatus=0, diagnostics=
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Container completed successfully.,
containerId=container_1415123350094_0017_01_000002
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from RM for container
ask, allocatedCnt=2
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000006, containerNode=onprem-tez2:45454,
containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell command on
a new container., containerId=container_1415123350094_0017_01_000007, containerNode=onprem-tez3:45454,
containerNodeURI=onprem-tez3:50060, containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000007
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up container launch
container for containerid=container_1415123350094_0017_01_000006
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message