hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "huaxiang sun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17889) ResultBoundedCompletionService's cancel() needs to interrupt the working thread and free it to the thread-pool
Date Fri, 07 Apr 2017 18:06:41 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961203#comment-15961203

huaxiang sun commented on HBASE-17889:

Thanks @stack and [~tedyu]. getTaskFuture() is not used anywhere. I will clean up code a bit.
getFuture/setFuture will be called in different threads (I think at least when the threadpool
is shutdown, cancel() will be called in a different thread), making it volatile seems needed.

The test done is based on 1.2 code. There is a test client who is doing continuos GET with
consistency TIMELINE. The table has 2 replicas. When the region server hosting the primary
replica is shutdown with "shutdown -r now", The test client is stuck for about 50 seconds,
the jstack dump is attached. I added trace log in the code, printing out the QueueingFuture
reference submitted and returned. Found out that before it is stuck, the QueueingFuture for
replica returned but ones for primary replica did not return. After this 50 seconds (socket
write times out), these QueueingFuture for primary replicas returned. This is to confirm that
the stucked threads are for the primary replicas. 

With this fix, the same test was performed, the testing client did not hang any more. The
trace log showed that threads for the primary replica got interrupted and completed after
its cancel() is called.

The master branch code has changed a bit as the lock is not there anymore. I think it still
applies to the master branch. I will try to do a test with the master branch.

> ResultBoundedCompletionService's cancel() needs to interrupt the working thread and free
it to the thread-pool
> --------------------------------------------------------------------------------------------------------------
>                 Key: HBASE-17889
>                 URL: https://issues.apache.org/jira/browse/HBASE-17889
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 2.0.0, 1.4.0, 1.2.6, 1.3.2
>            Reporter: huaxiang sun
>            Assignee: huaxiang sun
>         Attachments: HBASE-17889-master-001.patch, jstack.txt
> We run into one case with read-replica, when the server hosting the primary region is
shutdown, we see Get did not go to replica region and it paused for about 50 seconds before
Get was resumed. 
> More debugging finds out that when the server is down, one of the threads was stuck at
the write, it holds lock at 
> https://github.com/apache/hbase/blob/branch-1.3/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClientImpl.java#L916.
> The later write threads were waiting on this lock until all threads in the connection's
thread pool were stuck on this lock. At that moment, no work will be done. After socket write
times out, it frees up all threads and it continues.
> When QueueingFuture#cancel() is called, it does not interrupt the working thread and
return the thread to the pool.
> Attaching the jstack trace.

This message was sent by Atlassian JIRA

View raw message