Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5656E9C0A for ; Wed, 1 Feb 2012 04:47:39 +0000 (UTC) Received: (qmail 6564 invoked by uid 500); 1 Feb 2012 04:47:39 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 6155 invoked by uid 500); 1 Feb 2012 04:47:28 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 6133 invoked by uid 99); 1 Feb 2012 04:47:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2012 04:47:23 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2012 04:47:20 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 19559182605 for ; Wed, 1 Feb 2012 04:46:59 +0000 (UTC) Date: Wed, 1 Feb 2012 04:46:59 +0000 (UTC) From: "Uma Maheswara Rao G (Commented) (JIRA)" To: common-issues@hadoop.apache.org Message-ID: <1041364960.1484.1328071619105.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <10432440.225641290382393658.JavaMail.jira@thor> Subject: [jira] [Commented] (HADOOP-7047) RPC client gets stuck MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-7047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197577#comment-13197577 ] Uma Maheswara Rao G commented on HADOOP-7047: --------------------------------------------- In one of my cluster i faced similar situation. Clinet got OOME in Datastreamer thread, went for processDataNodeError. Here while creating datanode proxy connection, it got hanged. here is the dump, attched as well. {code} "DataStreamer for file /ngcdn/report/file/toptraffic/20120120-102619003-91.log.tmp block blk_1326295273061_564234" daemon prio=10 tid=0xfec4e000 nid=0x38d0 in Object.wait() [0xffff1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:940) - locked <0xb0a9d1e0> (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:245) at $Proxy6.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:389) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:376) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:413) at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:282) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3397) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2809) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3024) - locked <0xc55ad1e8> (a java.util.LinkedList) {code} I am in 20.2 version. We already merged the fix what Hairong pointed here. {code} try { while (waitForWork()) {// wait here for work - read or close // connection receiveResponse(); } } catch (Throwable t) { // This truly is unexpected, since we catch IOException in receiveResponse // -- this is only to be really sure that we don't leave a client hanging // forever. LOG.warn("Unexpected error reading responses on connection " + this, t); markClosed(new IOException("Error reading responses", t)); } {code} Looking at this, it should mark the connections closed and notify the waiting connections. This did not happen. Some how this thread got exited. We can find this from attached dump. Only namenode IPC CLient thread is there. Can't see DataNode IPC Client thread. Unportunately i ran with info logs and also not enabled console logs. I did not see any OOME from IPC Clinet therads in info logs. If this thread silently exited with some exception, then it would have logged in console. Only possible thing I see here is, throw OOME again from Throwable? > RPC client gets stuck > --------------------- > > Key: HADOOP-7047 > URL: https://issues.apache.org/jira/browse/HADOOP-7047 > Project: Hadoop Common > Issue Type: Bug > Components: ipc > Reporter: Hairong Kuang > Assignee: Hairong Kuang > Fix For: 0.22.0 > > Attachments: jstack.log, trunkStuckClient.patch > > > One of the dfs clients in our cluster stuck on waiting for a RPC result. However the IPC connection thread who is receiving the RPC result died on OOM error: > INFO >> Exception in thread "IPC Client (47) connection to XX from root" java.lang.OutOfMemoryError: Java heap space > INFO >> at java.util.Arrays.copyOfRange(Arrays.java:3209) > INFO >> at java.lang.String.(String.java:216) > INFO >> at java.lang.StringBuffer.toString(StringBuffer.java:585) > INFO >> at java.net.URI.toString(URI.java:1907) > INFO >> at java.net.URI.(URI.java:732) > INFO >> at org.apache.hadoop.fs.Path.initialize(Path.java:137) > INFO >> at org.apache.hadoop.fs.Path.(Path.java:126) > INFO >> at org.apache.hadoop.fs.FileStatus.readFields(FileStatus.java:206) > INFO >> at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237) > INFO >> at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:171) > INFO >> at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:219) > INFO >> at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66) > INFO >> at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:531) > INFO >> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:466) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira