hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From manoj <manojm....@gmail.com>
Subject Map tasks keep Running even after the node is killed on Apache Yarn.
Date Fri, 14 Aug 2015 23:53:35 GMT
Hi,

I'm on Apache2.6.0 YARN and I'm trying to test the dynamic addition and
removal of nodes from the Cluster.

The test starts a Job with 2 nodes and while the Job is progressing, It
removes one of the node* by killing the dataNode and NodeManager Daemons.(
is it ok to remove a node like this? )

*this node is not running ResourceManager/ApplicationMaster for sure.

After the node is successfully removed( I can confirm this from resource
manager logs- attached) the test adds it back and waits till the job
completes.

Node Removal Logs:

2015-08-14 11:15:56,902 INFO
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:
Expired:host172:36158 Timed out after 60 secs
2015-08-14 11:15:56,903 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
Deactivating Node host172:36158 as it is now LOST
2015-08-14 11:15:56,904 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
host172:36158 Node Transitioned from RUNNING to LOST
2015-08-14 11:15:56,905 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1439575616861_0001_01_000006 Container Transitioned from
RUNNING to KILLED
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
Completed container: container_1439575616861_0001_01_000006 in state:
KILLED event:KILL
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp
RESULT=SUCCESS  APPID=application_1439575616861_0001
CONTAINERID=container_1439575616861_0001_01_000006
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Released container container_1439575616861_0001_01_000006 of capacity
<memory:1024, vCores:1> on host host172:36158, which currently has 1
containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7>
available, release resources=true
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:3584, vCores:3> numContainers=3 user=hadoop
user-resources=<memory:3584, vCores:3>
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1439575616861_0001_01_000006, NodeId: host172:36158,
NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>,
Priority: 20, Token: Token { kind: ContainerToken, service:
XX.XX.0.2:36158 }, ] queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>,
usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1,
numContainers=3 cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
completedContainer queue=root usedCapacity=1.75
absoluteUsedCapacity=1.75 used=<memory:3584, vCores:3>
cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting completed queue: root.default stats: default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>,
usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1,
numContainers=3
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application attempt appattempt_1439575616861_0001_000001 released
container container_1439575616861_0001_01_000006 on node: host:
host172:36158 #containers=1 available=1024 used=1024 with event: KILL
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1439575616861_0001_01_000005 Container Transitioned from
RUNNING to KILLED
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
Completed container: container_1439575616861_0001_01_000005 in state:
KILLED event:KILL
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp
RESULT=SUCCESS  APPID=application_1439575616861_0001
CONTAINERID=container_1439575616861_0001_01_000005
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Released container container_1439575616861_0001_01_000005 of capacity
<memory:1024, vCores:1> on host host172:36158, which currently has 0
containers, <memory:0, vCores:0> used and <memory:2048, vCores:8>
available, release resources=true
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:2560, vCores:2> numContainers=2 user=hadoop
user-resources=<memory:2560, vCores:2>
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1439575616861_0001_01_000005, NodeId: host172:36158,
NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>,
Priority: 20, Token: Token { kind: ContainerToken, service:
XX.XX.0.2:36158 }, ] queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>,
usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1,
numContainers=2 cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
completedContainer queue=root usedCapacity=1.25
absoluteUsedCapacity=1.25 used=<memory:2560, vCores:2>
cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting completed queue: root.default stats: default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>,
usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1,
numContainers=2
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application attempt appattempt_1439575616861_0001_000001 released
container container_1439575616861_0001_01_000005 on node: host:
host172:36158 #containers=0 available=2048 used=0 with event: KILL
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Removed node host172:36158 clusterResource: <memory:2048, vCores:8>

Node Addition logs:

2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver:
Resolved host172 to /default-rack
2015-08-14 11:19:43,530 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered
with capability: <memory:2048, vCores:8>, assigned nodeId
host172:59426
2015-08-14 11:19:43,533 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
host172:59426 Node Transitioned from NEW to RUNNING
2015-08-14 11:19:43,535 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Added node host172:59426 clusterResource: <memory:4096, vCores:16>

*Here's the problem:*

The Job never completes! According to the logs the mapTasks which were
scheduled on the node that was removed are still "RUNNING" with a
mapProgress of 100%. These tasks stays in the same state forever.

In the AppMasterContainer logs I see that it continuously tries to connect
to the previous node host172/XX.XX.XX.XX:36158 though it was removed and
added on a different port host172/XX.XX.XX.XX:59426

......
......
2015-08-14 11:25:21,662 INFO [ContainerLauncher #7]
org.apache.hadoop.ipc.Client: Retrying connect to server:
host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
MILLISECONDS)
......
......

Please let me know if you need to see any more logs.

P.S: The Jobs completes normally without dynamic addition and removal of
nodes on the same Cluster with same memory settings.
Thanks,
--Manoj Kumar M

Mime
View raw message