hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhihong Yu (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()
Date Sun, 15 Jan 2012 21:43:40 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186601#comment-13186601

Zhihong Yu commented on HBASE-5202:

I got the following when I tried to apply HBASE-5202.patch:
2 out of 3 hunks FAILED -- saving rejects to file src/main/java/org/apache/hadoop/hbase/master/HMaster.java.rej
patching file src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Hunk #1 succeeded at 608 (offset -90 lines).
patching file src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
Hunk #1 succeeded at 21 with fuzz 1.
Hunk #2 FAILED at 35.
Hunk #3 succeeded at 50 with fuzz 2 (offset -6 lines).
Hunk #4 FAILED at 953.
2 out of 4 hunks FAILED -- saving rejects to file src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java.rej
Can you provide a new patch ?

Normally if a patch is accepted by Hadoop QA, we should only need to rerun the tests reported
as failed by Hadoop QA.

Thanks for working over the weekend.
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708]
master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter).

> The AssignmentManager's processFailover() method is passing a null to regionOnline()
because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding
to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}}
and {{.META.}}, while the onlineServers map is set according to the these servers' registering
with the master, there can be an inconsistency between the catalogTracker and the onlineServers
if either of these regionservers is online with respect to zookeeper, but haven't yet registered
with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover
to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for
the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by
the catalog tracker's getRootLocation() and getMetaLocation() to register with the master
before the master can continue with failover.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message