hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-13851) RpcClientImpl.close() can hang with cancelled replica RPCs
Date Sat, 06 Jun 2015 02:25:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575524#comment-14575524
] 

Enis Soztutar edited comment on HBASE-13851 at 6/6/15 2:24 AM:
---------------------------------------------------------------

Here is an explanation of what is happening for the brave souls: 

The RpcClientImpl just hangs in close() after interrupting every Connection thread that is
running.
We have the Connection thread and the CallSender thread per RS. CallSender is only started
if we are using specifiThreadForWriting (enabled for replica reads). The CallSender thread
is started in Connection constructor, while the Connection thread itself is started only after
the setupIOStreams() is successful. setupIOStreams() is called only in the case of a call
being written.

RPCClientImpl keeps a map of Connection objects. A new Rpc Call will create a new Connection
object and add it to the map if needed. On RpcClient.close() it interrupts all Connections
and waits until all Connections in the map are removed. Normally, the Connection thread after
getting an interruption will call markClosed() and then the Thread run loop will end which
as a last operation will call close(). Connection.close() removes the Connection from the
RpcClient's connections map.

If a replica RPC is performed, a Connection object is constructed, and added to the map. Normally
the Rpc Call is handled by the RpcSender thread which is already running, and it will setupIOStreams()
and depending on whether an exception or not, it will either start the Connection thread or
call Connection.close() which will remove the Connection from the map.

In a rare case, a new Connection can be created, but before CallSender sends the RPC call
and for that sets up IO streams and starts the Connection thread, the RPC may be cancelled
if another replica responded first. Previously we were not canceling the RPC, but after HBASE-12668,
the cancelation is happening which will cause the Connection thread to not start at all if
there are no more RPCs coming. In this case, since there is no Connection thread running,
the RpcClientImpl.close() will not be able to interrupt the thread (since it is not running),
and Connection.close() will never be called.


was (Author: enis):
Here is an explanation of what is happening for the brave souls: 

The RpcClientImpl just hangs in close() after interrupting every Connection thread that is
running.
We have the Connection thread and the CallSender thread per RS. CallSender is only started
if we are using specifiThreadForWriting (enabled for replica reads). The CallSender thread
is started in Connection constructor, while the Connection thread itself is started only after
the setupIOStreams() is successful. setupIOStreams() is called only in the case of a call
being written.
RPCClientImpl keeps a map of Connection objects. A new Rpc Call will create a new Connection
object and add it to the map if needed. On RpcClient.close() it interrupts all Connections
and waits until all Connections in the map are removed. Normally, the Connection thread after
getting an interruption will call markClosed() and then the Thread run loop will end which
as a last operation will call close(). Connection.close() removes the Connection from the
RpcClient's connections map.
If a replica RPC is performed, a Connection object is constructed, and added to the map. Normally
the Rpc Call is handled by the RpcSender thread which is already running, and it will setupIOStreams()
and depending on whether an exception or not, it will either start the Connection thread or
call Connection.close() which will remove the Connection from the map.
In a rare case, a new Connection can be created, but before CallSender sends the RPC call
and for that sets up IO streams and starts the Connection thread, the RPC may be cancelled
if another replica responded first. Previously we were not canceling the RPC, but after HBASE-12668,
the cancelation is happening which will cause the Connection thread to not start at all if
there are no more RPCs coming. In this case, since there is no Connection thread running,
the RpcClientImpl.close() will not be able to interrupt the thread (since it is not running),
and Connection.close() will never be called.

> RpcClientImpl.close() can hang with cancelled replica RPCs
> ----------------------------------------------------------
>
>                 Key: HBASE-13851
>                 URL: https://issues.apache.org/jira/browse/HBASE-13851
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0.0, 1.2.0, 1.1.1
>
>
> We have seen the clients hanging in running the test {{IntegrationTestRegionReplicaPerf}}
in 1.1 code base during the test.The jstack gives: 
> {code}
> "IPC Client (1344340481) connection to os-enis-dal-test-jun-4-1.openstacklocal/172.22.80.25:16020
from root - writer" daemon prio=10 tid=0x00007f3891b29800 nid=0x7345 waiting on condition
[0x00007f3865647000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x000000070d54a240> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>         at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:374)
>         at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$CallSender.run(RpcClientImpl.java:253)
> "TestClient-3" prio=10 tid=0x00007f3892660800 nid=0x63b0 waiting on condition [0x00007f386ecdd000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.hbase.ipc.RpcClientImpl.close(RpcClientImpl.java:1139)
>         at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.internalClose(ConnectionManager.java:2371)
>         at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.close(ConnectionManager.java:2384)
>         at org.apache.hadoop.hbase.PerformanceEvaluation$Test.testTakedown(PerformanceEvaluation.java:1036)
>         at org.apache.hadoop.hbase.PerformanceEvaluation$RandomReadTest.testTakedown(PerformanceEvaluation.java:1351)
>         at org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluation.java:1055)
>         at org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvaluation.java:1612)
>         at org.apache.hadoop.hbase.PerformanceEvaluation$1.call(PerformanceEvaluation.java:410)
>         at org.apache.hadoop.hbase.PerformanceEvaluation$1.call(PerformanceEvaluation.java:405)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message