Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 15E7D91EC for ; Thu, 29 Mar 2012 19:22:44 +0000 (UTC) Received: (qmail 32662 invoked by uid 500); 29 Mar 2012 19:22:43 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 32609 invoked by uid 500); 29 Mar 2012 19:22:43 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 32531 invoked by uid 99); 29 Mar 2012 19:22:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2012 19:22:43 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2012 19:22:42 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 5D12834D9A2 for ; Thu, 29 Mar 2012 19:22:22 +0000 (UTC) Date: Thu, 29 Mar 2012 19:22:22 +0000 (UTC) From: "Thomas Graves (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1628598140.34335.1333048942382.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <70506224.10611.1332539007710.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-4062) AM Launcher thread can hang forever MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241533#comment-13241533 ] Thomas Graves commented on MAPREDUCE-4062: ------------------------------------------ this seems to be the same issue that was seen when the AM hung launching containers in MAPREDUCE-3228. I'm investigating using an rpmTimeout when ContainerManagerPBClientImpl creates the proxy. If anyone knows a reason not to use the rpcTimeout please let me know. > AM Launcher thread can hang forever > ----------------------------------- > > Key: MAPREDUCE-4062 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4062 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 0.23.2 > Reporter: Thomas Graves > Assignee: Thomas Graves > > We saw an instance where the RM stopped launch Application masters. We found that the launcher thread was hung because something weird/bad happened to the NM node. Currently there is only 1 launcher thread (jira 4061 to fix that). We need this to not happen. Even once we increase the number of threads to > 1 if that many nodes go bad the RM would be stuck. Note that this was stuck like this for approximately 9 hours. > Stack trace on hung AM launcher: > "pool-1-thread-1" prio=10 tid=0x000000004343e800 nid=0x3a4c in Object.wait() > [0x000000004fad2000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.apache.hadoop.ipc.Client.call(Client.java:1076) > - locked <0x00002aab05a4f3f0> (a org.apache.hadoop.ipc.Client$Call) > at > org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:135) > at $Proxy76.startContainer(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:87) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:265) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira