Return-Path: Delivered-To: apmail-hadoop-hbase-issues-archive@minotaur.apache.org Received: (qmail 18685 invoked from network); 8 Apr 2010 18:04:00 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Apr 2010 18:04:00 -0000 Received: (qmail 18058 invoked by uid 500); 8 Apr 2010 18:04:00 -0000 Delivered-To: apmail-hadoop-hbase-issues-archive@hadoop.apache.org Received: (qmail 18035 invoked by uid 500); 8 Apr 2010 18:04:00 -0000 Mailing-List: contact hbase-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hbase-issues@hadoop.apache.org Received: (qmail 18013 invoked by uid 99); 8 Apr 2010 18:04:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Apr 2010 18:04:00 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Apr 2010 18:03:57 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 92F54234C052 for ; Thu, 8 Apr 2010 18:03:36 +0000 (UTC) Message-ID: <1586531588.17131270749816587.JavaMail.jira@brutus.apache.org> Date: Thu, 8 Apr 2010 18:03:36 +0000 (UTC) From: "Karthik Ranganathan (JIRA)" To: hbase-issues@hadoop.apache.org Subject: [jira] Commented: (HBASE-2413) Master does not respect generation stamps, may result in meta getting permanently offlined MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855052#action_12855052 ] Karthik Ranganathan commented on HBASE-2413: -------------------------------------------- << A more involved soln. would have us run the code that is executed when znode expires but then when znode actually expires, the code will be run again and we'd have to be careful we recognized the difference between the two runs and not knock out a server that was legit. >> I was thinking of something like this, not sure how easily this would translate to code: 1. In the server restart and sending a new start code, we process the shutdown of the older instance: if (this.serverAddressToServerInfo.containsKey(hostAndPort)) { if(newStartCode > currentStartCode) { // process shutdown of current incarnation of RS // register new incarnation of RS } } 2. On the znode expire path: // znodeExpiredHostAndPort = ... // znodeExpiredStartCode = ... if(serverAddressToServerInfo.containsKey(znodeExpiredHostAndPort) && znodeExpiredStartCode >= currentStartCode) { // process shutdown } else { // no op - this should already have been handled } > Master does not respect generation stamps, may result in meta getting permanently offlined > ------------------------------------------------------------------------------------------ > > Key: HBASE-2413 > URL: https://issues.apache.org/jira/browse/HBASE-2413 > Project: Hadoop HBase > Issue Type: Bug > Components: master > Affects Versions: 0.20.3 > Reporter: Karthik Ranganathan > Assignee: stack > Attachments: newserver.txt > > > This happens if the RS is restarted before the zk node expires. The sequence is as follows: > 1. RS1 dies - lets say its server string was HOST1:PORT1:TS1 > 2. In a few seconds RS1 is restarted, it comes up as HOST1:PORT1:TS2 (TS2 is more recent than TS1) > 3. Master gets a start up message from RS1 with the server name as HOST1:PORT1:TS2 > 4. Master adds this as a new RS, tries to red > ---- The master does not use the generation stamps to detect that RS1 has already restarted. > ---- Also, if RS1 contained meta, master would try to go to HOST1:PORT1:TS1. It would end up talking to HOST1:PORT1:TS2, which spews a bunch of not serving region exceptions. > 5. zk node expires for HOST1:PORT1:TS1 > 6. Master tries to process shutdown for HOST1:PORT1:TS1 - this probably interferes with HOST1:PORT1:TS2 and ends up somehow removing the reassign meta in the master's queue. > ---- Meta never comes online and master continues logging the following exception indefinitely: > 2010-04-06 11:02:23,988 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose of test1,7094000000,1270220428234, false, reassign: true > 2010-04-06 11:02:23,988 DEBUG org.apache.hadoop.hbase.master.ProcessRegionClose$1: Exception in RetryableMetaOperation: > java.lang.NullPointerException > at org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64) > at org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:63) > at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:494) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:429) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.