accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-4191) Tracing on client can sometimes lose "sendMutations" events.
Date Fri, 15 Apr 2016 16:42:25 GMT


ASF GitHub Bot commented on ACCUMULO-4191:

Github user ShawnWalker commented on the pull request:
    > ... the "lower" half of the span would be un-rooted (the part doing binning and sending).
    Is "unrooted" different from "unreported"?  My view of the tracing message suggests so,
but I'm uncertain as to exactly the difference.
    > Have you been able to notice your changes positively affecting ShellServerIT#trace's
success rate? I have seen it fail now and again, but never reliably.
    For me, `ShellServerIT.trace()` _always_ fails.  I've been banging my head against this
for the past 3 days, trying to get tests to succeed while working on ACCUMULO-4187.  I'll
take your word that it doesn't fail for everyone all the time.  With this change, `ShellServerIT`
has succeeded for me the 5 or 6 times I've run it.
    > Any idea as to why we only sometimes see ShellServerIT fail? Is there a reason why
this doesn't always fail that you noticed?
    My current theory is that it's a race condition depending on how long it takes to spin
up a new threadpool:
    At line `TabletServerBatchWriter` line 670, `binningThreadPool` uses a `SynchronousQueue`,
and at line 671, the `binningThreadPool` has a `CallerRunsPolicy`.  Together these mean "perform
binning in the other thread, but only if it's not currently busy doing binning; otherwise
just do the binning in the current thread."
    If the binning is done in the current thread, there is no need to propagate tracing information;
as you point out, this information is carried by thread-local variables.  If the binning ends
up being done in the thread pool, the tracing information needs to be propagated.
    But there's a race condition here: The `SynchronousQueue` will only allow a job to be
pushed into it if there's already a thread sitting waiting on a job.  So the race ends up
being which happens first: the consumer thread created to service `binningThreadPool` is ready
to accept work, or the producer thread is ready to commit some of its mutations.  In the former
case, the work gets done in the thread pool; in the latter case, the work gets done in the
current thread.
    > Any thoughts on how we could make a test which would specifically check for regressions
    Race conditions are notably difficult to diagnose.  What's more, this particular race
condition isn't an error (assuming the tracing is properly handled regardless of who wins
the race).  I can think of a few ways one might better expose this particular bug under lab
conditions, but any such test would by necessity be timing sensitive, and I do so hate knowingly
writing timing-sensitive tests.

> Tracing on client can sometimes lose "sendMutations" events.
> ------------------------------------------------------------
>                 Key: ACCUMULO-4191
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>            Reporter: Shawn Walker
>            Assignee: Shawn Walker
>            Priority: Minor
> In {{TabletServerBatchWriter.MutationWriter.queueMutations(...)}} (line 736), HTrace
{{Span}}'s are not properly propagated across thread boundaries.  Consequently, tracing doesn't
propagate into {{TabletServerBatchWriter.SendTask.send(...)}}, and so the {{sendMutation}}
event can fail to be logged.
> This can cause {{ShellServerIT.trace(...)}} to fail.

This message was sent by Atlassian JIRA

View raw message