accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ShawnWalker <...@git.apache.org>
Subject [GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...
Date Fri, 15 Apr 2016 16:41:57 GMT
Github user ShawnWalker commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210540256
  
    > ... the "lower" half of the span would be un-rooted (the part doing binning and sending).
    Is "unrooted" different from "unreported"?  My view of the tracing message suggests so,
but I'm uncertain as to exactly the difference.
    
    > Have you been able to notice your changes positively affecting ShellServerIT#trace's
success rate? I have seen it fail now and again, but never reliably.
    
    For me, `ShellServerIT.trace()` _always_ fails.  I've been banging my head against this
for the past 3 days, trying to get tests to succeed while working on ACCUMULO-4187.  I'll
take your word that it doesn't fail for everyone all the time.  With this change, `ShellServerIT`
has succeeded for me the 5 or 6 times I've run it.
    
    > Any idea as to why we only sometimes see ShellServerIT fail? Is there a reason why
this doesn't always fail that you noticed?
    
    My current theory is that it's a race condition depending on how long it takes to spin
up a new threadpool:
    
    At line `TabletServerBatchWriter` line 670, `binningThreadPool` uses a `SynchronousQueue`,
and at line 671, the `binningThreadPool` has a `CallerRunsPolicy`.  Together these mean "perform
binning in the other thread, but only if it's not currently busy doing binning; otherwise
just do the binning in the current thread."
    
    If the binning is done in the current thread, there is no need to propagate tracing information;
as you point out, this information is carried by thread-local variables.  If the binning ends
up being done in the thread pool, the tracing information needs to be propagated.
    
    But there's a race condition here: The `SynchronousQueue` will only allow a job to be
pushed into it if there's already a thread sitting waiting on a job.  So the race ends up
being which happens first: the consumer thread created to service `binningThreadPool` is ready
to accept work, or the producer thread is ready to commit some of its mutations.  In the former
case, the work gets done in the thread pool; in the latter case, the work gets done in the
current thread.
    
    > Any thoughts on how we could make a test which would specifically check for regressions
here?
    
    Race conditions are notably difficult to diagnose.  What's more, this particular race
condition isn't an error (assuming the tracing is properly handled regardless of who wins
the race).  I can think of a few ways one might better expose this particular bug under lab
conditions, but any such test would by necessity be timing sensitive, and I do so hate knowingly
writing timing-sensitive tests.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message