Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5545A200CEF for ; Mon, 4 Sep 2017 23:41:18 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5231E16353E; Mon, 4 Sep 2017 21:41:18 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9609B16353B for ; Mon, 4 Sep 2017 23:41:17 +0200 (CEST) Received: (qmail 54237 invoked by uid 500); 4 Sep 2017 21:41:15 -0000 Mailing-List: contact dev-help@thrift.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@thrift.apache.org Delivered-To: mailing list dev@thrift.apache.org Received: (qmail 54226 invoked by uid 99); 4 Sep 2017 21:41:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Sep 2017 21:41:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 2B2B1C1B70 for ; Mon, 4 Sep 2017 21:41:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 5ZfdTP-0npRg for ; Mon, 4 Sep 2017 21:41:10 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 80B5A60D2C for ; Mon, 4 Sep 2017 21:41:09 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 3F72EE0EEC for ; Mon, 4 Sep 2017 21:41:06 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id C3BE524158 for ; Mon, 4 Sep 2017 21:41:02 +0000 (UTC) Date: Mon, 4 Sep 2017 21:41:02 +0000 (UTC) From: =?utf-8?Q?Bu=C4=9Fra_Gedik_=28JIRA=29?= To: dev@thrift.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (THRIFT-3932) C++ ThreadManager has a rare termination race MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 04 Sep 2017 21:41:18 -0000 [ https://issues.apache.org/jira/browse/THRIFT-3932?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D161= 52958#comment-16152958 ]=20 Bu=C4=9Fra Gedik commented on THRIFT-3932: ------------------------------------- Hi [~jking3]. Do you know if there are any release plans that would contain= this fix? Is there a publicly available resource that talks about release = plans? Thanks! > C++ ThreadManager has a rare termination race > --------------------------------------------- > > Key: THRIFT-3932 > URL: https://issues.apache.org/jira/browse/THRIFT-3932 > Project: Thrift > Issue Type: Bug > Components: C++ - Library > Reporter: Bu=C4=9Fra Gedik > Assignee: James E. King, III > Fix For: 0.10.0 > > Attachments: thrift-patch > > Time Spent: 17h > Remaining Estimate: 0h > > {{ThreadManger::join}} calls {{stopImpl(true)}}, which in turn calls {{re= moveWorker(workerCount_);}}. The latter waits until {{while (workerCount_ != =3D workerMaxCount_)}}. Within the {{run}} method of the workers, the last = thread that detects {{workerCount_ =3D=3D workerMaxCount_}} notifies {{remo= veWorker}}. The {{run}} method has the following additional code that is ex= ecuted at the very end: > {code} > { > Synchronized s(manager_->workerMonitor_); > manager_->deadWorkers_.insert(this->thread()); > if (notifyManager) { > manager_->workerMonitor_.notify(); > } > } > {code} > This is an independent synchronized block. Now assume 2 threads. One of t= hem has {{notifyManager=3Dtrue}} as it detected the {{workerCount_ =3D=3D w= orkerMaxCount_}} condition earlier. It is possible that this thread gets to= execute the above code block first, {{ThreadManager}}'s {{removeWorker}} = method unblocks, and eventually {{ThreadManager}}'s {{join}} returns and th= e object is destructed. When the other thread reaches the synchronized bloc= k above, it will crash, as the manager is not around anymore. > Besides, {{ThreadManager}} never joins its threads. > Attached is a patch. > {panel:title=3DResolution Details|borderStyle=3Ddashed|borderColor=3D#ccc= |titleBGColor=3D#F7D6C1|bgColor=3D#FFFFCE} > A fair number of things were improved with this pull request: > # There are three monitors used for signaling; two of them use the same m= utex and then the workerMonitor is using its own. I made all three use the = same mutex. > # I replaced all calls to Synchronize with Guard on the mutex_ so it is a= bit clearer that there is one lock protecting everything. > # The worker run loop expires tasks as it encounters them. Now we only ca= ll removeExpiredTasks if there's no room during add. > # I found that when worker threads are removed they are not joined; if a = thread is not detached we now join those threads. > # I added better documentation and simplified the Worker::run() a little = so it is easier to understand, but the logic is the same. idle_ was not use= d and removed. > # I found that remove(Task) simply wasn't implemented at all! so I added = code for that and tests. > # Release mode tests of thread manager would not fail on error because th= ey use assert() to verify results! If we only do release builds in CI, thes= e tests may have been silently failing for a long time. > # ThreadManager::join() is unnecessary, as stop() will join worker thread= s (as will removeWorker) depending on the ThreadFactory's detached disposit= ion. > # Cleaned up some technical debt from the TServerFramework refactoring: > # Cleaned up some things in the different platform thread factories. Some= abstractions were unnecessary. > # Reduced overall time on concurrency test to about 30 seconds on my dev = system while increasing the reliability and improving test coverage signifi= cantly. I ran it 100 times in a loop. > # Added tests for most of the ThreadManager APIs. > # Tested with posix, std, and boost threads on linux. > # Tested on Windows. > # Fixed a math error in time calculation that was present on Windows (wit= h VS2010) which made expiring tasks impossible (unit test proved the issue = and the fix). > # Re-enable unit tests in the appveyor build that are no longer unstable. > {panel} -- This message was sent by Atlassian JIRA (v6.4.14#64029)