hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5916) RS restart just before master intialization we make the cluster non operative
Date Fri, 25 May 2012 12:48:24 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283349#comment-13283349
] 

ramkrishna.s.vasudevan commented on HBASE-5916:
-----------------------------------------------

@Chunhui
I like your idea too.  As i said we are planning to raise an improvement activity for master
restart and SSH.
Because even with the above approach i will tell one more scenario which is problematic. 
Pls note that the scenario can come even without your suggestion also.

Two region servers are there.  Both went down when the flow is in AM.joinCluster(). Now as
no RS is there at that time we will not make any assignment. And all will go into RIT mode
waiting for timeout monitor. Now SSH is also waiting as the master initialization is not complete(this
step is as per your suggestion).  Now suppose there are 100 regions all are waiting for getting
assigned.
Now if a new RS comes up as there is a code in TimeoutMonitor
{code}
 if (regionState.getStamp() + timeout <= now) {
           //decide on action upon timeout
            actOnTimeOut(regionState);
          } else if (this.allRegionServersOffline && !allRSsOffline) {
            // if some RSs just came back online, we can start the
            // the assignment right away
            actOnTimeOut(regionState);
          }
{code}
It will immediately trigger assignment.  At the same time as master initialization has already
been done and so we are able to carry on assignment with SSH also.  This will lead to double
assignment.  Actually in defect HBASe-5816 Stack was suggesting to have one common queue where
any assignment will be done so that SSH will not interfere with that or viceversa.  
I suggest we can get in the patch that addresses the current JIRa problem and work on a diff
JIRA that will help me to address the master restart and SSH area which is troublesome.

                
> RS restart just before master intialization we make the cluster non operative
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-5916
>                 URL: https://issues.apache.org/jira/browse/HBASE-5916
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>             Fix For: 0.94.1
>
>         Attachments: HBASE-5916_trunk.patch, HBASE-5916_trunk_1.patch, HBASE-5916_trunk_1.patch,
HBASE-5916_trunk_2.patch, HBASE-5916_trunk_3.patch, HBASE-5916_trunk_4.patch, HBASE-5916_trunk_v5.patch,
HBASE-5916_trunk_v6.patch, HBASE-5916_trunk_v7.patch, HBASE-5916v8.patch
>
>
> Consider a case where my master is getting restarted.  RS that was alive when the master
restart started, gets restarted before the master initializes the ServerShutDownHandler.
> {code}
> serverShutdownHandlerEnabled = true;
> {code}
> In this case when the RS tries to register with the master, the master will try to expire
the server but the server cannot be expired as still the serverShutdownHandler is not enabled.
> This case may happen when i have only one RS gets restarted or all the RS gets restarted
at the same time.(before assignRootandMeta).
> {code}
> LOG.info(message);
>       if (existingServer.getStartcode() < serverName.getStartcode()) {
>         LOG.info("Triggering server recovery; existingServer " +
>           existingServer + " looks stale, new server:" + serverName);
>         expireServer(existingServer);
>       }
> {code}
> If another RS is brought up then the cluster comes back to normalcy.
> May be a very corner case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message