Return-Path: Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: (qmail 44969 invoked from network); 27 Jul 2010 03:58:43 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Jul 2010 03:58:43 -0000 Received: (qmail 17397 invoked by uid 500); 27 Jul 2010 03:58:43 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 17338 invoked by uid 500); 27 Jul 2010 03:58:42 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 17300 invoked by uid 99); 27 Jul 2010 03:58:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Jul 2010 03:58:40 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Jul 2010 03:58:38 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o6R3wGkm028116 for ; Tue, 27 Jul 2010 03:58:16 GMT Message-ID: <32802221.16581280203096485.JavaMail.jira@thor> Date: Mon, 26 Jul 2010 23:58:16 -0400 (EDT) From: "stack (JIRA)" To: issues@hbase.apache.org Subject: [jira] Commented: (HBASE-2880) Hung cluster because master is hung because Get inside synchronize on RegionManager never returned In-Reply-To: <25180343.12361280187975748.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892634#action_12892634 ] stack commented on HBASE-2880: ------------------------------ So, looking into closeCurrentConnection(key), it looks like this: {code} private void closeCurrentConnection(SelectionKey key) { if (key != null) { Connection c = (Connection)key.attachment(); if (c != null) { if (LOG.isDebugEnabled()) LOG.debug(getName() + ": disconnecting client " + c.getHostAddress()); closeConnection(c); } } } {code} The key.attachment must be coming back null because I don't see the above 'disconnecting client' message in logs... here's more context from the logs: {code} Error, aborting java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:39) at java.nio.ByteBuffer.allocate(ByteBuffer.java:312) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:829) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:423) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:323) 2010-07-23 23:25:59,432 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=29, stores=30, storefiles=52, storefileIndexSize=8, memstoreSize=505, compactionQueueSize=6, usedHeap=976, maxHeap=987, blockCacheSize=160252032, blockCacheFree=46841728, blockCacheCount=2396, blockCacheHitRatio=0 2010-07-23 23:25:59,432 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020: exiting on OOME 2010-07-23 23:26:04,784 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://sv2borg185:9000/hbase/TestTable/e75047f4cd69996d26eb2d88918e373d/.tmp/4502975447844372826 to hdfs://sv2borg185:9000/hbase/TestTable/e75047f4cd69996d26eb2d88918e373d/info/316784095266859147 2010-07-23 23:26:04,838 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://sv2borg185:9000/hbase/TestTable/e75047f4cd69996d26eb2d88918e373d/info/316784095266859147, entries=85950, sequenceid=1424, memsize=95.7m, filesize=85.1m to TestTable,0047408745,1279952663377.e75047f4cd69996d26eb2d88918e373d. 2010-07-23 23:26:04,838 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Caches flushed, doing commit now (which includes update scanners) 2010-07-23 23:26:04,839 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~97.7m for region TestTable,0047408745,1279952663377.e75047f4cd69996d26eb2d88918e373d. in 29578ms, sequenceid=1424, compaction requested=true 2010-07-23 23:26:04,839 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: regionserver60020.cacheFlusher exiting ... {code} If I grep 'disconnecting... ' I don't find it. > Hung cluster because master is hung because Get inside synchronize on RegionManager never returned > -------------------------------------------------------------------------------------------------- > > Key: HBASE-2880 > URL: https://issues.apache.org/jira/browse/HBASE-2880 > Project: HBase > Issue Type: Bug > Reporter: stack > Priority: Critical > Fix For: 0.90.0 > > > I just ran into this testing 0.89 RC candidate. > So, Master is hung up because all threads are locked out because one thread is stuck inside a block that is synchronized on RegionManager (0x00007fe1f94777d0 in the below): > {code} > 3277 "IPC Server handler 9 on 60000" daemon prio=10 tid=0x00007fe1dc00f000 nid=0x409d in Object.wait() [0x00007fe1e9200000] > 3278 java.lang.Thread.State: WAITING (on object monitor) > 3279 at java.lang.Object.wait(Native Method) > 3280 at java.lang.Object.wait(Object.java:485) > 3281 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732) > 3282 - locked <0x00007fe1f8672818> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call) > 3283 at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:252) > 3284 at $Proxy1.get(Unknown Source) > 3285 at org.apache.hadoop.hbase.master.ServerManager.assignSplitDaughter(ServerManager.java:550) > 3286 at org.apache.hadoop.hbase.master.ServerManager.processSplitRegion(ServerManager.java:525) > 3287 - locked <0x00007fe1f94777d0> (a org.apache.hadoop.hbase.master.RegionManager) > 3288 at org.apache.hadoop.hbase.master.ServerManager.processMsgs(ServerManager.java:476) > 3289 at org.apache.hadoop.hbase.master.ServerManager.processRegionServerAllsWell(ServerManager.java:425) > 3290 at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:335) > 3291 at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:738) > 3292 at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > 3293 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > 3294 at java.lang.reflect.Method.invoke(Method.java:597) > 3295 at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) > 3296 at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919) > {code} > The above code is not returning because Call#callComplete is never going to be called on the outstanding Get. The target RS OOME'd. Something in the way an OOME is being processed made it so this connection is not ever going to be cleaned up/notified. > We're stuck here. > I'm trying to figure why the clean up is not happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.