accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dylan Hutchison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3975) Deadlock by recursive scans
Date Thu, 27 Aug 2015 03:14:45 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716000#comment-14716000
] 

Dylan Hutchison commented on ACCUMULO-3975:
-------------------------------------------

Righty, a nice new feature would be a mechanism for scans to open scans.  It would require
good design to avoid excessive complexity.  If one of the target tablets of the scan-within-a-scan
is on a different tablet server, then a Thrift RPC could occur as normal. For target tablets
to the same tablet server, it would be really nice to use a local backdoor, rather than going
through a new Thrift serialize-transmit-deserialize path from one port to another.  I originally
thought this is what {{IteratorEnvironment.reserveMapFileReader()}} is for, but that seems
to do something different.

Anyways, it would be nice to do something about the fixed thread pool size for readaheads
regardless of whether we make a new feature.  Do you know why it is fixed Josh?

> Deadlock by recursive scans
> ---------------------------
>
>                 Key: ACCUMULO-3975
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3975
>             Project: Accumulo
>          Issue Type: Bug
>          Components: mini, tserver
>    Affects Versions: 1.7.0
>            Reporter: Dylan Hutchison
>
> A tablet server has a fixed size thread pool that it uses for scanning.  The maximum
number of threads is controlled by {{tserver.readahead.concurrent.max}}, which defaults to
16.
> Take the use case of opening a Scanner inside of a server-side iterator.  The following
results in deadlock.
> 1. A client creates a BatchScanner (call this A) with enough query threads (say, 16)
that  it uses up all the readahead threads on a single tablet server.
> 2. Inside the scan on that unlucky tablet server, an iterator opens a Scanner (call these
B) to tablets on the same tablet server.
> 3. The Scanner Bs inside the iterators block because there is no free readahead thread
on the target tablet server to serve the request. They never unblock.  Essentially the tserver
scan threads block on trying to obtain tserver scan threads from the same thread pool.
> The tablet server does not seem to recover from this event even after the client disconnects
(e.g. by killing the client).  Not all the internalRead threads appear to die by IOException,
which can prevent subsequent scans with smaller numbers of tablets from succeeding.  It does
recover on restarting the tablet server.
> The tablet server has some mechanism to increase the thread pool size at {{rpc.TServerUtils.createSelfResizingThreadPool}}.
 It seems to be ineffective.  I see log messages like these:
> {noformat}
> 2015-08-26 21:35:24,247 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33 
> 2015-08-26 21:35:25,248 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33 
> 2015-08-26 21:35:26,250 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33 
> 2015-08-26 21:35:27,252 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33
> {noformat}
> Also a bunch of these pop up, in case it helps
> {noformat}
> 2015-08-26 21:38:29,417 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:34,428 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [0 0 0.00 1]  
> 2015-08-26 21:38:39,433 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:44,266 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:38802 !0
0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]  
> 2015-08-26 21:38:44,438 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:48,022 [tserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:38802 0
entries in 0.02 secs (lookup_time:0.02 secs tablets:1 ranges:1)  
> 2015-08-26 21:38:48,034 [tserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:38802 0
entries in 0.01 secs (lookup_time:0.01 secs tablets:1 ranges:1)  
> 2015-08-26 21:38:49,452 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:54,456 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:59,473 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:39:04,484 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> {noformat}
> I pushed a [test case that reproduces the deadlock in the Graphulo test code|https://github.com/Accla/graphulo/blob/master/src/test/java/edu/mit/ll/graphulo/AccumuloBugTest.java#L47].
 It shows that when we use less threads than {{tserver.readahead.concurrent.max}} (16), everything
is okay, but if we use more threads then deadlock occurs pretty reliably.
> We can imagine a few kinds of solutions, such as fixing the self-increasing thread pool
mechanism that does not appear to work, or making re-entrant thread pools.  Let's find a simple
solution.  If I had my druthers, I would create a mechanism for an Accumulo iterator to read
from other tables in the same instance without having to open up a Scanner, which is an improvement
beyond the scope of this ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message