hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18955) HBase client queries stale hbase:meta location with half-dead RegionServer
Date Fri, 06 Oct 2017 21:21:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195261#comment-16195261
] 

Josh Elser commented on HBASE-18955:
------------------------------------

I was able to trivially work around this issue by clearing meta's cache in SSH after we fail
to read it. However, this isn't the right place to fix the issue. I can see the WebUI threads
(to load the user tables' regions) is still stuck trying to poll from the old RS.

I _think_ this is supposed to be handled within {{RegionServerCallable}}
{code}
  @Override
  public void throwable(Throwable t, boolean retrying) {
    if (t instanceof SocketTimeoutException ||
        t instanceof ConnectException ||
        t instanceof RetriesExhaustedException ||
        (location != null && getConnection().isDeadServer(location.getServerName())))
{
      // if thrown these exceptions, we clear all the cache entries that
      // map to that slow/dead server; otherwise, let cache miss and ask
      // hbase:meta again to find the new location
      if (this.location != null) getConnection().clearCaches(location.getServerName());
    } else if (t instanceof RegionMovedException) {
      getConnection().updateCachedLocations(tableName, row, t, location);
    } else if (t instanceof NotServingRegionException && !retrying) {
      // Purge cache entries for this specific region from hbase:meta cache
      // since we don't call connect(true) when number of retries is 1.
      getConnection().deleteCachedRegionLocation(location);
    }
  }
{code}

However, the exception we actually see is as follows:

{noformat}
java.io.IOException: Call to hw10447.local/10.200.31.19:16201 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
Call id=42, waitTime=60003, operationTimeout=60000 expired.
{noformat}

Maybe this logic just needs to be expanded?

FYI [~enis], [~tedyu]

> HBase client queries stale hbase:meta location with half-dead RegionServer
> --------------------------------------------------------------------------
>
>                 Key: HBASE-18955
>                 URL: https://issues.apache.org/jira/browse/HBASE-18955
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.1.12
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.1.13
>
>
> Have been investigating a case with [~tedyu] where, when a RegionServer becomes "hung"
(for no specific reason -- not the point), the client becomes stuck trying to talk to this
RegionServer, never exiting. This was eventually tracked down to HBASE-15645. However, in
testing the fix, I found that there is an additional problem which only affects branch-1.1.
> When the RegionServer in the "half-dead" state is also hosting meta, the hbase client
(both the one trying to read data, but also the client in the Master trying to read meta in
SSH) get stuck repeatedly trying to read meta from the old location after meta has been reassigned.
> The general test outline goes like this:
> * Start at least 2 regionservers
> * Load some data into a table ({{hbase pe}} is great)
> * Find a region that is hosted by the same RS that is hosting meta
> * {{kill -SIGSTOP}} that RS hosting the user region and meta
> * Issue a {{get}} in the hbase-shell trying to read from that user region
> The expectation is that the ZK lock will expire for the STOP'ed RS, meta will be reassigned,
then the user regions will be reassigned, then the client will get the result of the get without
seeing an error (as long as this happens within the hbase.client.operation.timeout value,
of course).
> We see this happening on HBase 1.2.4 and 1.3.2-SNAPSHOT, but, on 1.1.13-SNAPSHOT, the
Master gets up to re-assigning meta, but then gets stuck trying to read meta from the STOP'ed
RS instead of where it re-assigned it. Because of this, the regions stay in transition until
the master is restarted or the STOP'ed RS is CONT'ed. My best guess is that when the RS sees
the {{SIGCONT}}, it immediately begins stopping which is enough to kick the client into refreshing
the region location cache.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message