accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3975) Deadlock by recursive scans
Date Thu, 27 Aug 2015 15:22:46 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716837#comment-14716837
] 

Josh Elser commented on ACCUMULO-3975:
--------------------------------------

I think there are two things being discussed here: One is what Dylan sees as a bug: non-dynamic
resizing of the readahead threads. The other is a new improvement that supports "scans within
scans" (or w/e you want to call it).

Thanks to John for also chiming in -- I think there is purpose behind having this pool be
fixed. Perhaps we can do something to better explain/understand this though, so I think it
might be premature to "Won't Fix" this right now. As for the latter improvement, I'll leave
Dylan to open another ticket if he wants to push that effort.

> Deadlock by recursive scans
> ---------------------------
>
>                 Key: ACCUMULO-3975
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3975
>             Project: Accumulo
>          Issue Type: Bug
>          Components: mini, tserver
>    Affects Versions: 1.7.0
>            Reporter: Dylan Hutchison
>             Fix For: 1.8.0
>
>
> A tablet server has a fixed size thread pool that it uses for scanning.  The maximum
number of threads is controlled by {{tserver.readahead.concurrent.max}}, which defaults to
16.
> Take the use case of opening a Scanner inside of a server-side iterator.  The following
results in deadlock.
> 1. A client creates a BatchScanner (call this A) with enough query threads (say, 16)
that  it uses up all the readahead threads on a single tablet server.
> 2. Inside the scan on that unlucky tablet server, an iterator opens a Scanner (call these
B) to tablets on the same tablet server.
> 3. The Scanner Bs inside the iterators block because there is no free readahead thread
on the target tablet server to serve the request. They never unblock.  Essentially the tserver
scan threads block on trying to obtain tserver scan threads from the same thread pool.
> The tablet server does not seem to recover from this event even after the client disconnects
(e.g. by killing the client).  Not all the internalRead threads appear to die by IOException,
which can prevent subsequent scans with smaller numbers of tablets from succeeding.  It does
recover on restarting the tablet server.
> The tablet server has some mechanism to increase the thread pool size at {{rpc.TServerUtils.createSelfResizingThreadPool}}.
 It seems to be ineffective.  I see log messages like these:
> {noformat}
> 2015-08-26 21:35:24,247 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33 
> 2015-08-26 21:35:25,248 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33 
> 2015-08-26 21:35:26,250 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33 
> 2015-08-26 21:35:27,252 [rpc.TServerUtils] INFO : Increasing server thread pool size
on TabletServer to 33
> {noformat}
> Also a bunch of these pop up, in case it helps
> {noformat}
> 2015-08-26 21:38:29,417 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:34,428 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [0 0 0.00 1]  
> 2015-08-26 21:38:39,433 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:44,266 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:38802 !0
0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]  
> 2015-08-26 21:38:44,438 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:48,022 [tserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:38802 0
entries in 0.02 secs (lookup_time:0.02 secs tablets:1 ranges:1)  
> 2015-08-26 21:38:48,034 [tserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:38802 0
entries in 0.01 secs (lookup_time:0.01 secs tablets:1 ranges:1)  
> 2015-08-26 21:38:49,452 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:54,456 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:59,473 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:39:04,484 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:40168 !0
0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> {noformat}
> I pushed a [test case that reproduces the deadlock in the Graphulo test code|https://github.com/Accla/graphulo/blob/master/src/test/java/edu/mit/ll/graphulo/AccumuloBugTest.java#L47].
 It shows that when we use less threads than {{tserver.readahead.concurrent.max}} (16), everything
is okay, but if we use more threads then deadlock occurs pretty reliably.
> We can imagine a few kinds of solutions, such as fixing the self-increasing thread pool
mechanism that does not appear to work, or making re-entrant thread pools.  Let's find a simple
solution.  If I had my druthers, I would create a mechanism for an Accumulo iterator to read
from other tables in the same instance without having to open up a Scanner, which is an improvement
beyond the scope of this ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message