thrift-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Buğra Gedik (JIRA) <j...@apache.org>
Subject [jira] [Commented] (THRIFT-3932) C++ ThreadManager has a rare termination race
Date Mon, 04 Sep 2017 21:41:02 GMT

    [ https://issues.apache.org/jira/browse/THRIFT-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152958#comment-16152958
] 

Buğra Gedik commented on THRIFT-3932:
-------------------------------------

Hi [~jking3]. Do you know if there are any release plans that would contain this fix? Is there
a publicly available resource that talks about release plans? Thanks!

> C++ ThreadManager has a rare termination race
> ---------------------------------------------
>
>                 Key: THRIFT-3932
>                 URL: https://issues.apache.org/jira/browse/THRIFT-3932
>             Project: Thrift
>          Issue Type: Bug
>          Components: C++ - Library
>            Reporter: Buğra Gedik
>            Assignee: James E. King, III
>             Fix For: 0.10.0
>
>         Attachments: thrift-patch
>
>          Time Spent: 17h
>  Remaining Estimate: 0h
>
> {{ThreadManger::join}} calls {{stopImpl(true)}}, which in turn calls {{removeWorker(workerCount_);}}.
The latter waits until {{while (workerCount_ != workerMaxCount_)}}. Within the {{run}} method
of the workers, the last thread that detects {{workerCount_ == workerMaxCount_}} notifies
{{removeWorker}}. The {{run}} method has the following additional code that is executed at
the very end:
> {code}
>     {
>       Synchronized s(manager_->workerMonitor_);
>       manager_->deadWorkers_.insert(this->thread());
>       if (notifyManager) {
>         manager_->workerMonitor_.notify();
>       }
>     }
> {code}
> This is an independent synchronized block. Now assume 2 threads. One of them has {{notifyManager=true}}
as it detected the {{workerCount_ == workerMaxCount_}} condition earlier. It is possible that
this thread gets to execute  the above code block first, {{ThreadManager}}'s {{removeWorker}}
method unblocks, and eventually {{ThreadManager}}'s {{join}} returns and the object is destructed.
When the other thread reaches the synchronized block above, it will crash, as the manager
is not around anymore.
> Besides, {{ThreadManager}} never joins its threads.
> Attached is a patch.
> {panel:title=Resolution Details|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
> A fair number of things were improved with this pull request:
> # There are three monitors used for signaling; two of them use the same mutex and then
the workerMonitor is using its own. I made all three use the same mutex.
> # I replaced all calls to Synchronize with Guard on the mutex_ so it is a bit clearer
that there is one lock protecting everything.
> # The worker run loop expires tasks as it encounters them. Now we only call removeExpiredTasks
if there's no room during add.
> # I found that when worker threads are removed they are not joined; if a thread is not
detached we now join those threads.
> # I added better documentation and simplified the Worker::run() a little so it is easier
to understand, but the logic is the same. idle_ was not used and removed.
> # I found that remove(Task) simply wasn't implemented at all! so I added code for that
and tests.
> # Release mode tests of thread manager would not fail on error because they use assert()
to verify results! If we only do release builds in CI, these tests may have been silently
failing for a long time.
> # ThreadManager::join() is unnecessary, as stop() will join worker threads (as will removeWorker)
depending on the ThreadFactory's detached disposition.
> # Cleaned up some technical debt from the TServerFramework refactoring:
> # Cleaned up some things in the different platform thread factories. Some abstractions
were unnecessary.
> # Reduced overall time on concurrency test to about 30 seconds on my dev system while
increasing the reliability and improving test coverage significantly. I ran it 100 times in
a loop.
> # Added tests for most of the ThreadManager APIs.
> # Tested with posix, std, and boost threads on linux.
> # Tested on Windows.
> # Fixed a math error in time calculation that was present on Windows (with VS2010) which
made expiring tasks impossible (unit test proved the issue and the fix).
> # Re-enable unit tests in the appveyor build that are no longer unstable.
> {panel}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message