hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liu Shaohui (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-12534) Wrong region location cache in client after regions are moved
Date Fri, 21 Nov 2014 11:20:35 GMT

     [ https://issues.apache.org/jira/browse/HBASE-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Liu Shaohui updated HBASE-12534:
--------------------------------
    Attachment: HBASE-12534-v1.diff

Solution: To make code simple, delete the region location cache whenever the client gets the
NotServingRegionException from regionserver, and the ignore the retry logic. 

> Wrong region location cache in client after regions are moved
> -------------------------------------------------------------
>
>                 Key: HBASE-12534
>                 URL: https://issues.apache.org/jira/browse/HBASE-12534
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.24
>            Reporter: Liu Shaohui
>            Assignee: Liu Shaohui
>            Priority: Critical
>              Labels: client
>         Attachments: HBASE-12534-v1.diff
>
>
> In our 0.94 hbase cluster, we found that client got wrong region location cache and did
not update it after a region is moved to another regionserver.
> The reason is wrong client config and bug in RpcRetryingCaller  of hbase client.
> The rpc configs are following:
> {code}
> hbase.rpc.timeout=1000
> hbase.client.pause=200
> hbase.client.operation.timeout=1200
> {code}
> But the client retry number is 3
> {code}
> hbase.client.retries.number=3
> {code}
> Assumed that a region is at regionserver A before, and then it is moved to regionserver
B. The client try to make a  call to regionserver A and get an NotServingRegionException.
For the rety number is not 1, the region server location cache is not cleaned. See: RpcRetryingCaller.java#141
and RegionServerCallable.java#127
> {code}
>   @Override
>   public void throwable(Throwable t, boolean retrying) {
>     if (t instanceof SocketTimeoutException ||
>       ....
>     } else if (t instanceof NotServingRegionException && !retrying) {
>       // Purge cache entries for this specific region from hbase:meta cache
>       // since we don't call connect(true) when number of retries is 1.
>       getConnection().deleteCachedRegionLocation(location);
>     }
>   }
> {code}
> But the call did not retry and throw an SocketTimeoutException for the time the call
will take is larger than the operation timeout.See RpcRetryingCaller.java#152
> {code}
>         expectedSleep = callable.sleep(pause, tries + 1);
>         // If, after the planned sleep, there won't be enough time left, we stop now.
>         long duration = singleCallDuration(expectedSleep);
>         if (duration > callTimeout) {
>           String msg = "callTimeout=" + callTimeout + ", callDuration=" + duration +
>               ": " + callable.getExceptionMessageAdditionalDetail();
>           throw (SocketTimeoutException)(new SocketTimeoutException(msg).initCause(t));
>         }
> {code}
> At last, the wrong region location will never be not cleaned up . 
> [~lhofhansl]
> In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default, which trigger
this bug. 
> {code}
>   private long singleCallDuration(final long expectedSleep) {
>     return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
>       + MIN_RPC_TIMEOUT + expectedSleep;
>   }
> {code}
> But there is risk in master code too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message