Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 55B4198ED for ; Sun, 15 Jan 2012 21:00:02 +0000 (UTC) Received: (qmail 45347 invoked by uid 500); 15 Jan 2012 21:00:02 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 45274 invoked by uid 500); 15 Jan 2012 21:00:01 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 45252 invoked by uid 99); 15 Jan 2012 21:00:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Jan 2012 21:00:01 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Jan 2012 21:00:00 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id DBFAA14ED1F for ; Sun, 15 Jan 2012 20:59:39 +0000 (UTC) Date: Sun, 15 Jan 2012 20:59:39 +0000 (UTC) From: "Eugene Koontz (Updated) (JIRA)" To: issues@hbase.apache.org Message-ID: <2082717589.43120.1326661179902.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <230374541.42919.1326651819527.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline() MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koontz updated HBASE-5202: --------------------------------- Summary: NPE during Master failover in master.AssignmentManager.regionOnline() (was: NPE in master.AssignmentManager.regionOnline()) > NPE during Master failover in master.AssignmentManager.regionOnline() > --------------------------------------------------------------------- > > Key: HBASE-5202 > URL: https://issues.apache.org/jira/browse/HBASE-5202 > Project: HBase > Issue Type: Bug > Affects Versions: 0.90.6 > Reporter: Eugene Koontz > Assignee: Eugene Koontz > Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt > > > The following NPE can occur during master failover: > {code} > 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown. > java.lang.NullPointerException > at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724) > at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214) > at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279) > at java.lang.Thread.run(Thread.java:636) > {code} > This is caused by regionOnline() being passed a null serverInfo (its second parameter). > The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as: > {code} > hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); > {code} > and > > {code} > hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); > {code} > getHServerInfo(), is defined as: > {code} > public HServerInfo getHServerInfo(final HServerAddress hsa) { > synchronized(this.onlineServers) { > // TODO: This is primitive. Do a better search. > for (Map.Entry e: this.onlineServers.entrySet()) { > if (e.getValue().getServerAddress().equals(hsa)) { > return e.getValue(); > } > } > } > return null; > } > {code} > This can return null because the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). > Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver). > The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. > The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira