accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3975) Deadlock by recursive scans
Date Thu, 27 Aug 2015 14:53:45 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716791#comment-14716791
] 

Christopher Tubbs commented on ACCUMULO-3975:
---------------------------------------------

I think of it as an improvement, more than a bug... but it'd be nice if we controlled the
behavior a bit better, so that users can't shoot themselves in the foot by causing a deadlock.

> Deadlock by recursive scans
> ---------------------------
>
>                 Key: ACCUMULO-3975
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3975
>             Project: Accumulo
>          Issue Type: Bug
>          Components: mini, tserver
>    Affects Versions: 1.7.0
>            Reporter: Dylan Hutchison
>             Fix For: 1.8.0
>
>
> A tablet server has a fixed size thread pool that it uses for scanning.  The maximum
number of threads is controlled by {{tserver.readahead.concurrent.max}}, which defaults to
16.
> Take the use case of opening a Scanner inside of a server-side iterator.  The following
results in deadlock.
> 1. A client creates a BatchScanner (call this A) with enough query threads (say, 16)
that  it uses up all the readahead threads on a single tablet server.
> 2. Inside the scan on that unlucky tablet server, an iterator opens a Scanner (call these
B) to tablets on the same tablet server.
> 3. The Scanner Bs inside the iterators block because there is no free readahead thread
on the target tablet server to serve the request. They never unblock.  Essentially the tserver
scan threads block on trying to obtain tserver scan threads from the same thread pool.
> The tablet server does not seem to recover from this event even after the client disconnects
(e.g. by killing the client).  Not all the internalRead threads appear to die by IOException,
which can prevent subsequent scans with smaller numbers of tablets from succeeding.  It does
recover on restarting the tablet server.
> The tablet server has some mechanism to increase the thread pool size at {{rpc.TServerUtils.createSelfResizingThreadPool}}.
 It seems to be ineffective.  I see log messages like these:
> {noformat}
> 2015-08-26 21:35:24,247 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33 
> 2015-08-26 21:35:25,248 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33 
> 2015-08-26 21:35:26,250 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33 
> 2015-08-26 21:35:27,252 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33
> {noformat}
> Also a bunch of these pop up, in case it helps
> {noformat}
> 2015-08-26 21:38:29,417 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:34,428 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [0 0 0.00 1]  
> 2015-08-26 21:38:39,433 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:44,266 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:38802 !0
0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]  
> 2015-08-26 21:38:44,438 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:48,022 [tserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:38802 0
entries in 0.02 secs (lookup_time:0.02 secs tablets:1 ranges:1)  
> 2015-08-26 21:38:48,034 [tserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:38802 0
entries in 0.01 secs (lookup_time:0.01 secs tablets:1 ranges:1)  
> 2015-08-26 21:38:49,452 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:54,456 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:59,473 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:39:04,484 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> {noformat}
> I pushed a [test case that reproduces the deadlock in the Graphulo test code|https://github.com/Accla/graphulo/blob/master/src/test/java/edu/mit/ll/graphulo/AccumuloBugTest.java#L47].
 It shows that when we use less threads than {{tserver.readahead.concurrent.max}} (16), everything
is okay, but if we use more threads then deadlock occurs pretty reliably.
> We can imagine a few kinds of solutions, such as fixing the self-increasing thread pool
mechanism that does not appear to work, or making re-entrant thread pools.  Let's find a simple
solution.  If I had my druthers, I would create a mechanism for an Accumulo iterator to read
from other tables in the same instance without having to open up a Scanner, which is an improvement
beyond the scope of this ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message