hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrey Klochkov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-5501) RMContainer Allocator does not stop when cluster shutdown is performed in tests
Date Tue, 10 Sep 2013 22:45:52 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated MAPREDUCE-5501:
---------------------------------------

    Attachment: hanging-rmcontainer-allocator.stdout
    
> RMContainer Allocator does not stop when cluster shutdown is performed in tests
> -------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5501
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5501
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: trunk
>            Reporter: Andrey Klochkov
>         Attachments: hanging-rmcontainer-allocator.stdout, hanging-rmcontainer-allocator.syslog
>
>
> After running MR job client tests many MRAppMaster processes stay alive. The reason seems
that RMContainer Allocator thread ignores InterruptedException and keeps retrying:
> {code}
> 2013-09-09 18:52:07,505 WARN [RMCommunicator Allocator] org.apache.hadoop.util.ThreadUtil:
interrupted while sleeping
> java.lang.InterruptedException: sleep interrupted
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:149)
>         at com.sun.proxy.$Proxy29.allocate(Unknown Source)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:154)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:553)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:236)
>         at java.lang.Thread.run(Thread.java:680)
> 2013-09-09 18:52:37,639 INFO [RMCommunicator Allocator] org.apache.hadoop.ipc.Client:
Retrying connect to server: dhcpx-197-141.corp.yahoo.com/10.73.197.141:61163. Already tried
0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
> 2013-09-09 18:52:38,640 INFO [RMCommunicator Allocator] org.apache.hadoop.ipc.Client:
Retrying connect to server: dhcpx-197-141.corp.yahoo.com/10.73.197.141:61163. Already tried
1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
> {code}
> It takes > 6 minutes for the processes to die, and this causes various issues with
tests which use the same DFS dir. 
> {code}
> 2013-09-09 22:26:47,179 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Error communicating with RM: Could not contact RM after 360000 milliseconds.
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Could not contact RM after 360000
milliseconds.
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:563)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:236)
>         at java.lang.Thread.run(Thread.java:680)
> {code}
> Will attach a thread dump separately. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message