Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 73499 invoked from network); 30 Jun 2007 18:53:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Jun 2007 18:53:27 -0000 Received: (qmail 32536 invoked by uid 500); 30 Jun 2007 18:53:29 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 32501 invoked by uid 500); 30 Jun 2007 18:53:28 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 32492 invoked by uid 99); 30 Jun 2007 18:53:28 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Jun 2007 11:53:28 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Jun 2007 11:53:24 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9EDEB7141DC for ; Sat, 30 Jun 2007 11:53:04 -0700 (PDT) Message-ID: <19457294.1183229584647.JavaMail.jira@brutus> Date: Sat, 30 Jun 2007 11:53:04 -0700 (PDT) From: "stack (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-1523) Hung region servers waiting on write locks In-Reply-To: <26096400.1182560665882.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HADOOP-1523: -------------------------- Attachment: locks-v2.patch Here is commit message to along with this patch: HADOOP-1523 'Hung region servers waiting on write locks' On shutdown, region servers and masters were just cancelling leases without letting 'lease expired' code run -- code to clean up outstanding locks in region server. Outstanding read locks were getting in the way of region server getting necessary write locks needed for the shutdown process. Also, cleaned up messaging around shutdown so its clean -- no timeout messages as region servers try to talk to a master that has already shutdown -- even when region servers take their time going down. M src/contrib/hbase/conf/hbase-default.xml Make region server timeout 30 seconds instead of 3 minutes. Clients retry anyways. Make so its likely region servers report in their shutdown message before their lease expires on master. M src/contrib/hbase/src/java/org/apache/hadoop/hbase/Leases.java (closeAfterLeasesExpire): Added. * src/contrib/hbase/src/java/org/apache/hadoop/hbase/HRegionServer.java Added comments. (stop): Converted from public to default access (master shuts down regionservers). (run): Use leases.closeAfterLeasesExpire instead of leases.close. Changed log of main thread exit from DEBUG to INFO. * src/contrib/hbase/src/java/org/apache/hadoop/hbase/HMaster.java (letRegionsServersShutdown): Add better explaination of shutdown process to method doc. Changed timeout waits from hbase.regionserver.msginterval to threadWakeFrequency. (regionServerReport): If closing, we used to immediately respond to region server with a MSG_REGIONSERVER_STOP. This meant that we avoided handling of the region servers MSG_REPORT_EXITING sent on shutdown so region servers had no chance to cancel their lease in the master. Reordered. Moved sending of MSG_REGIONSERVER_STOP to after handling of MSG_REPORT_EXITING. Also, in handling of MSG_REGIONSERER_STOP removed cancelling of leases. Let leases expire normally (or get cancelled when the region server comes in with MSG_RPORT_EXITING). * src/contrib/hbase/src/java/org/apache/hadoop/hbase/HMsg.java (MSG_REGIONSERVER_STOP_IN_ARRAY): Added. > Hung region servers waiting on write locks > ------------------------------------------ > > Key: HADOOP-1523 > URL: https://issues.apache.org/jira/browse/HADOOP-1523 > Project: Hadoop > Issue Type: Bug > Components: contrib/hbase > Reporter: stack > Assignee: stack > Attachments: locks-v2.patch > > > A couple of times this afternoon I"ve been able to manufacture a hung region server variously stuck trying to obtain write locks either on memcache or a row lock on HRegion. The lease expiration must not be working properly (shutting down all open scanners). Maybe locks should be expiring. > {code} > "IPC Server handler 2 on 60010" daemon prio=5 tid=0x005167f0 nid=0x189d000 in Object.wait() [0xb1397000..0xb1397d10] > at java.lang.Object.wait(Native Method) > - waiting on <0x0b316ba8> (a java.util.HashMap) > at java.lang.Object.wait(Object.java:474) > at org.apache.hadoop.hbase.HRegion.obtainRowLock(HRegion.java:1211) > - locked <0x0b316ba8> (a java.util.HashMap) > at org.apache.hadoop.hbase.HRegion.startUpdate(HRegion.java:1020) > at org.apache.hadoop.hbase.HRegionServer.startUpdate(HRegionServer.java:1007) > at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566) > "IPC Server handler 1 on 60010" daemon prio=5 tid=0x005163f0 nid=0x189cc00 in Object.wait() [0xb1316000..0xb1316d10] > at java.lang.Object.wait(Native Method) > - waiting on <0x0b317148> (a java.lang.Integer) > at java.lang.Object.wait(Object.java:474) > at org.apache.hadoop.hbase.HLocking.obtainWriteLock(HLocking.java:82) > - locked <0x0b317148> (a java.lang.Integer) > at org.apache.hadoop.hbase.HMemcache.add(HMemcache.java:153) > at org.apache.hadoop.hbase.HRegion.commit(HRegion.java:1144) > - locked <0x0b398080> (a org.apache.hadoop.io.Text) > at org.apache.hadoop.hbase.HRegionServer.commit(HRegionServer.java:1071) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.