hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
Date Thu, 19 Jul 2012 16:08:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418397#comment-13418397
] 

nkeywal commented on HBASE-6364:
--------------------------------

That's a possible cause:

in HBaseClient#getConnection()
{noformat}
    do {
      synchronized (connections) {
        connection = connections.get(remoteId);
        if (connection == null) {
          connection = new Connection(remoteId);
          connections.put(remoteId, connection);
        }
      }
    } while (!connection.addCall(call));
    
    connection.setupIOstreams();
    return connection;
  }
{noformat}
Connection#addCall and Connection#setupIOstreams are synchronized. if #setupIOstreams fails,
it marks the connection as dead, remove it from the connections list, and throws an exception.
In #addCall it returns false if the connection is marked as dead. So
case 1 -> Sometimes, we may add the call to a connection that will be marked as dead:
 Thread 1: create the connection, add it to the connections list, call addCall
 Thread 2: get the connection, add it to the calls list 
 Thread 1: get into setupIOstreams, fails, and mark the connection as dead, throws an exception,
done
 Thread 2: get into setupIOstreams, see that the connection is dead, done. The call has been
added to a dead connection

case 2 -> If we have a lot of threads on a dying connection, we will have:
 Thread 1: goes until setupIOstreams
 All other threads: get the connection from the list, wait on the synchronized addCall 
 Thread 1: exit from setupIOstreams with an exception after 20 seconds (socket timeout)
 All other threads: call addCall, as the connection is dead, reloop
 One of these threads will create a connection
 One of these thread will win the race on addCall
 One of them will win the race on setupIOstreams
 Most of them should be waiting on addCall so reloop
 So back to the case 1 or 2.


We would have the same behavior on pure hadoop client (ipc.Client), as the implementation
is similar, at least on 1.0.3.


Suraj, does this match your analysis? How many region servers and regions did you have during
your test? What's the client doing?
                
> Powering down the server host holding the .META. table causes HBase Client to take excessively
long to recover and connect to reassigned .META. table
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6364
>                 URL: https://issues.apache.org/jira/browse/HBASE-6364
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.90.6, 0.92.1, 0.94.0
>            Reporter: Suraj Varma
>            Assignee: nkeywal
>              Labels: client
>
> When a server host with a Region Server holding the .META. table is powered down on a
live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected
HBase Client's take an excessively long time to detect this and re-discover the reassigned
.META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  value (default
is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms
getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... and keep
it off)
> 3) Measure how long it takes for cluster to reassign META table and for client threads
to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e.
for the thread count to go back to normal) - no client calls are serviced - they just back
up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams
method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this synchronized method
was blocked on  NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the dead RS
(till socket times out after 20s), retries, and then the next thread gets in and so forth
in a serial manner.
> Workaround:
> -------------------
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 
100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered
in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned
RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial "HConnection"
setup via the NetUtils.connect and should only ever be used when connectivity to a region
server is lost and needs to be re-established. i.e it does not affect the normal "RPC" actiivity
as this is just the connect timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will require
.META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message