Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 57430 invoked from network); 7 Nov 2009 23:44:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Nov 2009 23:44:57 -0000 Received: (qmail 60755 invoked by uid 500); 7 Nov 2009 23:44:56 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 60704 invoked by uid 500); 7 Nov 2009 23:44:56 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 60694 invoked by uid 99); 7 Nov 2009 23:44:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Nov 2009 23:44:56 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Nov 2009 23:44:53 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 69265234C045 for ; Sat, 7 Nov 2009 15:44:32 -0800 (PST) Message-ID: <1798705219.1257637472415.JavaMail.jira@brutus> Date: Sat, 7 Nov 2009 23:44:32 +0000 (UTC) From: "stack (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Updated: (HBASE-1928) ROOT and META tables stay in transition state (making the system not usable) if the designated regionServer dies before the assignment is complete In-Reply-To: <1451204577.1256232779377.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-1928: ------------------------- Resolution: Fixed Fix Version/s: 0.21.0 Status: Resolved (was: Patch Available) Committed branch and trunk. Thank you for the patch Yannis (Passes all tests locally). > ROOT and META tables stay in transition state (making the system not usable) if the designated regionServer dies before the assignment is complete > -------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-1928 > URL: https://issues.apache.org/jira/browse/HBASE-1928 > Project: Hadoop HBase > Issue Type: Bug > Components: master > Affects Versions: 0.20.0, 0.20.1 > Environment: Linux > Reporter: Yannis Pavlidis > Fix For: 0.20.2, 0.21.0 > > Attachments: 1928-branch.patch, diff_ProcessServerShutdown.txt, diff_RegionManager.txt, diff_ServerManager.txt, HBASE-1928.patch, master_cache01.txt, region_cache01.txt, region_cache02.txt > > > During a ROOT or META table re-assignment if the designated regionServer dies before the assignment is complete then the whole cluster becomes unavailble since the ROOT or META tables cannot be accessed (and never recover since they are kept in a transition state). > These are the 4 steps to replicate this issue (this is the easiest way to replicate. You can imagine that the following can occur in any real system). > Pre condition > ============ > 1. a cluster of 3 nodes (cache01, cache02, search01). > 2. start the system (start-hbase) > 3. cache02 has META, search01 has ROOT, cache01 has regionServer and Master. > Case 1: > ======= > 1. kill cache01 > 2. kill cache02 > 3. now search01 has both ROOT and META. > 4. re-start RegionServers on cache01 and cache02 > 5. Tail the master logs and grep for "Assigning region -ROOT-" and also "Assigning region .META." (need to windows for easiness) > 6. kill search01 > 7. wait to see to which server the ROOT will be assigned (from the tail) > 8. quickly kill that server > 9. you should notice that the ROOT server never gets re-assigned (because it is stuck in the regionsInTransitions) > The termination occurs through the ServerManager::removeServerInfo since the regionServer sends back to the master in a report that it is shutting down. > Case 2: > ======== > Repeat Case1 and in step 7 and 8 kill the server that has the META region assigned to it. Again the cluster becomes unavailble because the META region stays in the regionsInTransitions. > The termination occurs through the ServerManager::removeServerInfo since the regionServer sends back to the master in a report that it is shutting down. > Case 3: > ======== > Repeat Case1 and in step 7 and 8 kill the server with kill -9 instead of kill. This will not give the opportunity to the regionServer to send back the master in the report that it is terminating. The master will realize this because the znode will expire (but it is a different code path from before - it goes to the ProcessServerShutdown). > Case 4: > ======== > Repeat Case3 and in step 7 and 8 kill the server with kill -9 instead of kill. This will not give the opportunity to the regionServer to send back the master in the report that it is terminating. The master will realize this because the znode will expire (but it is a different code path from before - it goes to the ProcessServerShutdown). > The solution would be to check the in the ServerManager:removeServerInfo and in ProcessServerShutdown::closeMetaRegions whether the server that has been terminated has been assigned either the ROOT or META table. And if they have make sure we make those table ready to be re-assigned again. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.