hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12534) Wrong region location cache in client after regions are moved
Date Fri, 21 Nov 2014 11:42:33 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220803#comment-14220803
] 

Hadoop QA commented on HBASE-12534:
-----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12682857/HBASE-12534-0.94-v1.diff
  against master branch at commit 325cdc0987f8176ac46695f5b0c93b0fc6605ab9.
  ATTACHMENT ID: 12682857

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include any new or modified
tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    {color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11776//console

This message is automatically generated.

> Wrong region location cache in client after regions are moved
> -------------------------------------------------------------
>
>                 Key: HBASE-12534
>                 URL: https://issues.apache.org/jira/browse/HBASE-12534
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Liu Shaohui
>            Assignee: Liu Shaohui
>            Priority: Critical
>              Labels: client
>         Attachments: HBASE-12534-0.94-v1.diff, HBASE-12534-v1.diff
>
>
> In our 0.94 hbase cluster, we found that client got wrong region location cache and did
not update it after a region is moved to another regionserver.
> The reason is wrong client config and bug in RpcRetryingCaller  of hbase client.
> The rpc configs are following:
> {code}
> hbase.rpc.timeout=1000
> hbase.client.pause=200
> hbase.client.operation.timeout=1200
> {code}
> But the client retry number is 3
> {code}
> hbase.client.retries.number=3
> {code}
> Assumed that a region is at regionserver A before, and then it is moved to regionserver
B. The client try to make a  call to regionserver A and get an NotServingRegionException.
For the rety number is not 1, the region server location cache is not cleaned. See: RpcRetryingCaller.java#141
and RegionServerCallable.java#127
> {code}
>   @Override
>   public void throwable(Throwable t, boolean retrying) {
>     if (t instanceof SocketTimeoutException ||
>       ....
>     } else if (t instanceof NotServingRegionException && !retrying) {
>       // Purge cache entries for this specific region from hbase:meta cache
>       // since we don't call connect(true) when number of retries is 1.
>       getConnection().deleteCachedRegionLocation(location);
>     }
>   }
> {code}
> But the call did not retry and throw an SocketTimeoutException for the time the call
will take is larger than the operation timeout.See RpcRetryingCaller.java#152
> {code}
>         expectedSleep = callable.sleep(pause, tries + 1);
>         // If, after the planned sleep, there won't be enough time left, we stop now.
>         long duration = singleCallDuration(expectedSleep);
>         if (duration > callTimeout) {
>           String msg = "callTimeout=" + callTimeout + ", callDuration=" + duration +
>               ": " + callable.getExceptionMessageAdditionalDetail();
>           throw (SocketTimeoutException)(new SocketTimeoutException(msg).initCause(t));
>         }
> {code}
> At last, the wrong region location will never be not cleaned up . 
> [~lhofhansl]
> In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default, which trigger
this bug. 
> {code}
>   private long singleCallDuration(final long expectedSleep) {
>     return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
>       + MIN_RPC_TIMEOUT + expectedSleep;
>   }
> {code}
> But there is risk in master code too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message