hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kannan Muthukkaruppan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3141) Master RPC server needs to be started before an RS can check in
Date Fri, 22 Oct 2010 04:42:15 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923754#action_12923754
] 

Kannan Muthukkaruppan commented on HBASE-3141:
----------------------------------------------

We ran into this today during a shutdown/startup:

In 0.89, things happen in this order in the master code:

{code}
In the constructor:
(i) this.rpcServer = HBaseRPC.getServer(this, a.getBindAddress()... )   // instantiate the
server..
(ii) Try to become "primary" master, by writing to zookeeper.
In the run loop:
(iii) startServiceThreads() --> this.rpcServer.start()
{code}

Step (ii) blocked indefinitely, as a different master became the primary. At startup, some
Region Servers were  trying to report in to this master incorrectly... because the /hbase/master
ZK node from previous shutdown hadn't quite expired (?) and it still had this master's info.

What if we simply moved (iii) ahead of (ii) (i.e. start the rpcServer in the constructor itself,
before blocking on ZK's /hbase/master node). 

Todd's fix seems more elaborate -- is that extra state of "accepting calls" really necessary?

Hairong has also suggested that   we add timeouts on the HBaseRpc.getProxy() calls. See stack
below where the RS was stuck indefinitely on the above master.

{code}

"regionserver60020" prio=10 tid=0x00002aaeb4e5d000 nid=0x1cae in Object.wait() [0x000000004264e000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00002aaab7560fa8> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732)
        - locked <0x00002aaab7560fa8> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:252)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:408)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:384)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:431)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:342)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1210)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1227)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:432)
        at java.lang.Thread.run(Thread.java:619)

{code}




> Master RPC server needs to be started before an RS can check in
> ---------------------------------------------------------------
>
>                 Key: HBASE-3141
>                 URL: https://issues.apache.org/jira/browse/HBASE-3141
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> Starting up an RPC server is done in two steps.  In the constructor, we instantiate the
RPC server.  Then in startServiceThreads() we start() it.
> If someone RPCs in between the instantiation and the start(), it seems that bad things
can happen.  We need to make sure this can't happen and there aren't any races here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message