hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5916) RS restart just before master intialization we make the cluster non operative
Date Fri, 25 May 2012 04:26:23 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283110#comment-13283110

ramkrishna.s.vasudevan commented on HBASE-5916:

First of all thanks for your time in preparing a patch.
I think if we don't get the new online servers in joincluster there is one problem
     STEP 1: this.serverManager.expireDeadNotExpiredServers();

    // Update meta with new HRI if required. i.e migrate all HRI with HTD to
    // HRI with out HTD in meta and update the status in ROOT. This must happen
    // before we assign all user regions or else the assignment will fail.
    // TODO: Remove this when we do 0.94.
  STEP 2:    org.apache.hadoop.hbase.catalog.MetaMigrationRemovingHTD.

    // Fixup assignment manager status
    status.setStatus("Starting assignment manager");
  STEP 3:    this.assignmentManager.joinCluster(onlineServers);
I will tell you one scenario, may be its too rare but still possible
I have 3 RS at STEP 1.
one of them goes down and the SSH processes and tries to assign the regions.
Before the assignment is done one new RS comes up before STEP 3.
There is a small chance that the regions from dead RS are assigned to this new RS.  Now in
step 3 as we have already got the online servers list we may end up in thinking the new RS
as an offline server after scanning META.  Pls do correct me.  Its a corner case.
> RS restart just before master intialization we make the cluster non operative
> -----------------------------------------------------------------------------
>                 Key: HBASE-5916
>                 URL: https://issues.apache.org/jira/browse/HBASE-5916
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>             Fix For: 0.94.1
>         Attachments: HBASE-5916_trunk.patch, HBASE-5916_trunk_1.patch, HBASE-5916_trunk_1.patch,
HBASE-5916_trunk_2.patch, HBASE-5916_trunk_3.patch, HBASE-5916_trunk_4.patch, HBASE-5916_trunk_v5.patch,
HBASE-5916_trunk_v6.patch, HBASE-5916_trunk_v7.patch, HBASE-5916v8.patch
> Consider a case where my master is getting restarted.  RS that was alive when the master
restart started, gets restarted before the master initializes the ServerShutDownHandler.
> {code}
> serverShutdownHandlerEnabled = true;
> {code}
> In this case when the RS tries to register with the master, the master will try to expire
the server but the server cannot be expired as still the serverShutdownHandler is not enabled.
> This case may happen when i have only one RS gets restarted or all the RS gets restarted
at the same time.(before assignRootandMeta).
> {code}
> LOG.info(message);
>       if (existingServer.getStartcode() < serverName.getStartcode()) {
>         LOG.info("Triggering server recovery; existingServer " +
>           existingServer + " looks stale, new server:" + serverName);
>         expireServer(existingServer);
>       }
> {code}
> If another RS is brought up then the cluster comes back to normalcy.
> May be a very corner case.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message