hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Esteban Gutierrez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10871) Indefinite OPEN/CLOSE wait on busy RegionServers
Date Wed, 04 Jun 2014 04:38:02 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017368#comment-14017368

Esteban Gutierrez commented on HBASE-10871:

[~jxiang] I ran into the same issue recently. Can we just let the master retry the assignment
in case of a {{java.net.SocketTimeoutException}}, e.g. just remove {{return}}?

 if (t instanceof java.net.SocketTimeoutException 
            && this.serverManager.isServerOnline(plan.getDestination())) {
          LOG.warn("Call openRegion() to " + plan.getDestination()
              + " has timed out when trying to assign "
              + region.getRegionNameAsString()
              + ", but the region might already be opened on "
              + plan.getDestination() + ".", t);
         // return; <=== 

> Indefinite OPEN/CLOSE wait on busy RegionServers
> ------------------------------------------------
>                 Key: HBASE-10871
>                 URL: https://issues.apache.org/jira/browse/HBASE-10871
>             Project: HBase
>          Issue Type: Improvement
>          Components: Balancer, master, Region Assignment
>    Affects Versions: 0.94.6
>            Reporter: Harsh J
> We observed a case where, when a specific RS got bombarded by a large amount of regular
requests, spiking and filling up its RPC queue, the balancer's invoked unassigns and assigns
for regions that dealt with this server entered into an indefinite retry loop.
> The regions specifically began waiting in PENDING_CLOSE/PENDING_OPEN states indefinitely
cause of the HBase Client RPC from the ServerManager at the master was running into SocketTimeouts.
This caused a region unavailability in the server for the affected regions. The timeout monitor
retry default of 30m in 0.94's AM compounded the waiting gap further a bit more (this is now
10m in 0.95+'s new AM, and has further retries before we get there, which is good).
> Wonder if there's a way to improve this situation generally. PENDING_OPENs may be easy
to handle - we can switch them out and move them elsewhere. PENDING_CLOSEs may be a bit more
tricky, but there must perhaps at least be a way to "give up" permanently on a movement plan,
and letting things be for a while hoping for the RS to recover itself on its own (such that
clients also have a chance of getting things to work in the meantime)?

This message was sent by Atlassian JIRA

View raw message