Hi,

I'm running Hadoop cluster in 2.2.0 release, I encounter one issue in which one container is never killed/clean up. What I'm doing is we set mapreduce.task.timeout to 4 hours and due to some bug in one 3rdparty library, the task stuck somewhere and cannot make any progress, I'm hoping is that the NodeManager will kill the container and release any resource the container is taking, but actually the container is still there and the job still exists, below is the log snippets:

BTW, all jobs are uberized.

Task log:
===============================

2014-04-07 11:43:20,407 INFO [Main Thread] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1396648506811_47889_000001
2014-04-07 11:43:20,514 WARN [Main Thread] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2014-04-07 11:43:20,514 WARN [Main Thread] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2014-04-07 11:43:20,647 INFO [Main Thread] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
2014-04-07 11:43:20,647 INFO [Main Thread] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (org.apache.hadoop.yarn.security.AMRMTokenIdentifier@11f5ecc8)
2014-04-07 11:43:20,695 INFO [Main Thread] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: The specific max attempts: 2 for application: 47889. Attempt num: 1 is last retry: false
2014-04-07 11:43:20,700 INFO [Main Thread] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Using mapred newApiCommitter.

.....


2014-04-07 15:43:53,448 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1396648506811_47889_m_000000_0: AttemptID:attempt_1396648506811_47889_m_000000_0 Timed out after 14400 secs
2014-04-07 15:43:53,450 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1396648506811_47889_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP

....
2014-04-07 23:14:23,340 INFO [communication thread] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:26,340 INFO [communication thread] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:29,340 INFO [communication thread] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:32,340 INFO [communication thread] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:35,340 INFO [communication thread] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:38,341 INFO [communication thread] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:41,341 INFO [communication thread] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:44,341 INFO [communication thread] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1396648506811_47889_m_000000_0

So you can see the task was timeout, but state is switched to FAIL_CONTAINER_CLEANUP.
and then "Ping from attempt_1396648506811_47889_m_000000_0" keeps showing up for a couple of hours, looks like RMAppMaster still think the task is alive instead of timeout?

Below is NM log:
=============================

2014-04-07 11:43:18,883 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1396648506811_47889_01_000001 transitioned from LOCALIZING to LOCALIZED
2014-04-07 11:43:18,894 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1396648506811_47889_01_000001 transitioned from LOCALIZED to RUNNING
2014-04-07 11:43:18,898 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [nice, -n, 0, bash, /scratch/tie/software/hadoop2/hadoop-ic/temp/nm-local-dir/usercache/tie/appcache/application_1396648506811_47889/container_1396648506811_47889_01_000001/default_container_executor.sh]
2014-04-07 11:43:19,134 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 11:43:20,136 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 11:43:20,178 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1396648506811_47889_01_000001
2014-04-07 11:43:20,188 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 4795 for container-id container_1396648506811_47889_01_000001: 52.1 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used

..............................
#timeout happens
2014-04-07 15:43:52,925 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:53,926 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:54,405 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 4795 for container-id container_1396648506811_47889_01_000001: 153.7 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2014-04-07 15:43:54,927 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:55,928 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:56,929 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:57,415 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 4795 for container-id container_1396648506811_47889_01_000001: 153.7 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2014-04-07 15:43:57,930 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:58,931 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:59,932 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:44:00,426 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 4795 for container-id container_1396648506811_47889_01_000001: 153.7 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2014-04-07 15:44:00,933 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000


I do not see any message related to attempt to kill the container, is this a known issue? How can we work around this?

Thanks for your help

--
--Anfernee