hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liu Shaohui (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-12534) Wrong region location cache in client after regions are moved
Date Wed, 19 Nov 2014 12:37:33 GMT
Liu Shaohui created HBASE-12534:

             Summary: Wrong region location cache in client after regions are moved
                 Key: HBASE-12534
                 URL: https://issues.apache.org/jira/browse/HBASE-12534
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.94.24
            Reporter: Liu Shaohui
            Assignee: Liu Shaohui
            Priority: Critical

In our 0.94 hbase cluster, we found that client got wrong region location cache and did not
update it after a region is moved to another regionserver.
The reason is wrong client config and bug in RpcRetryingCaller  of hbase client.
The rpc configs are following:
But the client retry number is 3
Assumed that a region is at regionserver A before, and then it is moved to regionserver B.
The client try to make a  call to regionserver A and get an NotServingRegionException. For
the rety number is not 1, the region server location cache is not cleaned. See: RpcRetryingCaller.java#141
and RegionServerCallable.java#127
  public void throwable(Throwable t, boolean retrying) {
    if (t instanceof SocketTimeoutException ||
    } else if (t instanceof NotServingRegionException && !retrying) {
      // Purge cache entries for this specific region from hbase:meta cache
      // since we don't call connect(true) when number of retries is 1.
But the call did not retry and throw an SocketTimeoutException for the time the call will
take is larger than the operation timeout.See RpcRetryingCaller.java#152
        expectedSleep = callable.sleep(pause, tries + 1);

        // If, after the planned sleep, there won't be enough time left, we stop now.
        long duration = singleCallDuration(expectedSleep);
        if (duration > callTimeout) {
          String msg = "callTimeout=" + callTimeout + ", callDuration=" + duration +
              ": " + callable.getExceptionMessageAdditionalDetail();
          throw (SocketTimeoutException)(new SocketTimeoutException(msg).initCause(t));

At last, the wrong region location will never be not cleaned up . 

In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default, which trigger
this bug. 
  private long singleCallDuration(final long expectedSleep) {
    return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
      + MIN_RPC_TIMEOUT + expectedSleep;
But there is risk in master code too.

This message was sent by Atlassian JIRA

View raw message