hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy
Date Wed, 07 Sep 2011 15:54:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13099048#comment-13099048

jiraposter@reviews.apache.org commented on HBASE-4015:

This is an automatically generated e-mail. To reply, visit:

(Updated 2011-09-07 15:52:26.864199)

Review request for Ted Yu, Michael Stack, Jean-Daniel Cryans, and Jonathan Gray.


Updated the patch as per the review comments.
Changed hijackAndPreempt to hijack.


HBASE-4015 - updated patch.


Instead of storing the state when timeout was deducted we process it
using future task.

        // TODO: Could check if it was on deadServers.  If it was, then we could
        // do what happens in TimeoutMonitor when it sees this condition.

        // Just insert region into RIT
        // If this never updates the timeout will trigger new assignment
        if (regionInfo.isMetaRegion() || regionInfo.isRootRegion()) {
          regionsInTransition.put(encodedRegionName, new RegionState(
              regionInfo, RegionState.State.OPENING, data.getStamp(), data
        regionsInTransition.put(encodedRegionName, new RegionState(regionInfo,
            RegionState.State.OPENING, data.getStamp(), data.getOrigin()));

This change is for HBASE-4203. META and ROOT table need not wait till timeout

In forceRegionStateToOffline()

    } else {
      // If invoked from timeout monitor donot force it to OFFLINE. Based on the
      // state we will decide if to change in-memory state to OFFLINE or not.  It will
      // be done before setting the znode to OFFLINE state.
      if (!timeOutMonitorReAllocate) {
        LOG.debug("Forcing OFFLINE; was=" + state);
If the timeout monitor tries to reallcoate the node then dont make
the inmemory state to OFFLINE.
But the noraml assign flow doesnot expect the inmemory state to OFFLINE.
Hence the above change.  This is continued with the check in 
int setOfflineInZooKeeper(final RegionState state,
      boolean timeOutMonitorReAllocate) {
    // If invoked from timeoutmonitor the current state in memory need not be
    // OFFLINE.  
    if (!timeOutMonitorReAllocate && !state.isClosed() && !state.isOffline())
          this.master.abort("Unexpected state trying to OFFLINE; " + state,
          new IllegalStateException());
      return -1;

    boolean allowCreation = false;
    // If the isReAllocate is true and the current state is PENDING_OPEN
    // or OPENING then update the inmemory state to PENDING_OPEN. This is
    // important because
    // if timeoutmonitor deducts that a region was in OPENING state for a long
    // time but by the
    // time timeout monitor tranits the node to OFFLINE the RS would have opened
    // the node and the
    // state in znode will be RS_ZK_REGION_OPENED. Inorder to invoke the
    // OpenedRegionHandler
    // we expect the inmemeory state to be PENDING_OPEN or OPENING.
    // For all other cases we can change the inmemory state to OFFLINE.
    if (timeOutMonitorReAllocate
        && (state.getState().equals(RegionState.State.PENDING_OPEN) || state
            .getState().equals(RegionState.State.OPENING))) {
      allowCreation = false;
    } else {
      allowCreation = true;
This change is quite tricky.  
In normal assign flow the unassigned node for the region will not be present
Hence we need to allow the creation of the node newly.
But in timeout monitor case we will have the node present in some state hence 
we decide whether to create node newly or not inside ZKAssign.createOrForceNodeOffline

The above code also updates the inmemory state of OFFLINE or PENDING_OPEN
which was not update in the previous forceRegionStateToOffline() call.
In ZKAssign.java()
    if (version == -1) {
      // If timeoutmonitor deducts a node to be in OPENING state but before it
      // could
      // transit to OFFLINE state if RS had opened the region then the Master
      // deletes the
      // assigned region znode. In that case the znode will not exist. So we
      // should not
      // create the znode again which will lead to double assignment.
      if (timeOutMonitorReAllocate && !allowCreation) {
        return -1;
this part prevents double assignment
If timeoutmonitor tries to force to OFFLINE state an existing region which was in RIT
but before it could do that if the node was opened then openedregionhandler will delete
the node hence we should not create the node again.

In ZkAssign.java()
    } else {
      RegionTransitionData curDataInZNode = ZKAssign.getDataNoWatch(zkw, region
          .getEncodedName(), stat);
      // Do not move the node to OFFLINE if znode is in any of the following
      // state.
      // Because these are already executed states.
      if (timeOutMonitorReAllocate && null != curDataInZNode) {
        EventType eventType = curDataInZNode.getEventType();
        if (eventType.equals(EventType.RS_ZK_REGION_CLOSING)
            || eventType.equals(EventType.RS_ZK_REGION_CLOSED)
            || eventType.equals(EventType.RS_ZK_REGION_OPENED)) {
          return -1;
This check prevents from moving the node to OFFLINE state if just before
the node is tried to force to OFFLINE the RS would have changed the state 
to either CLOSING or CLOSED or OPENED.  now again moving to OFFLINE
will lead to douoble assignment and will an additional operaition.

In ZKassign.java()
      boolean setData = false;
      try {
        setData = ZKUtil.setData(zkw, node, data.getBytes(), version);
        // Setdata throws KeeperException which aborts the Master. So we are
        // catching it here.
        // If just before setting the znode to OFFLINE if the RS has made any
        // change to the
        // znode state then we need to return -1.
      } catch (KeeperException kpe) {
        LOG.info("Version mismatch while setting the node to OFFLINE state.");
        return -1;
This change is actually to avoid the master from abortng. If the forceful OFFLINE is in
progrss but just before setting the RS has changed the state.
Then the setData will fail leading to master abort.
Hence we are catching the exception.

In assignmentManager.java
+      if (setOfflineInZK && versionOfOfflineNode == -1)
+        return;
This is nothing but the refactoring done in the existing code.
-     if (setOfflineInZK && !setOfflineInZooKeeper(state)) return;
So if setting the version is unsuccessful return.
In ZKassign.java()
// the below check ensure that double assignment doesnot happen.
    // When the node is created for the first time then the expected version
    // that is
    // passed will be -1 and the version in znode will be 0.
    // In all other cases the version in znode will be > 0.
    else if (beginState.equals(EventType.M_ZK_REGION_OFFLINE)
        && endState.equals(EventType.RS_ZK_REGION_OPENING)
        && expectedVersion == -1 && stat.getVersion() != 0) {
      LOG.warn(zkw.prefix("Attempt to transition the " + "unassigned node for "
          + encoded + " from " + beginState + " to " + endState + " failed, "
          + "the node existed but was version " + stat.getVersion()
          + " not the expected version " + expectedVersion));
      return -1;
As the comment explains when the node is created for first time the expectedversion will
be -1 but the actual version will be 0.  Here the scenario is
If RS1 has not tranitioned the node from OFFLINE to OPENING if
RS2 gets the call from Master after forcefully chaning to OFFLINE the
Rs2 will take the control of the node.
At that time if RS1 starts transmitting the node then we should not allow

This addresses bug HBASE-4015.

Diffs (updated)


Diff: https://reviews.apache.org/r/1668/diff


Yes.  But could not add new test case.
TestMasterFailOver is passing with the current changes also.



> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>                 Key: HBASE-4015
>                 URL: https://issues.apache.org/jira/browse/HBASE-4015
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.92.0
>         Attachments: HBASE-4015_1_trunk.patch, HBASE-4015_2_trunk.patch, HBASE-4015_reprepared_trunk_2.patch,
Timeoutmonitor with state diagrams.pdf
> The current implementation of the TimeoutMonitor acts like a race condition generator,
mostly making things worse rather than better. It does it's own thing for a while without
caring for what's happening in the rest of the master.
> The first thing that needs to happen is that the regions should not be processed in one
big batch, because that sometimes can take minutes to process (meanwhile a region that timed
out opening might have opened, then what happens is it will be reassigned by the TimeoutMonitor
generating the never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure how to do
it in a scalable way in this case.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message