Return-Path: Delivered-To: apmail-incubator-river-dev-archive@minotaur.apache.org Received: (qmail 97968 invoked from network); 5 Aug 2010 21:42:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Aug 2010 21:42:15 -0000 Received: (qmail 2405 invoked by uid 500); 5 Aug 2010 21:42:14 -0000 Delivered-To: apmail-incubator-river-dev-archive@incubator.apache.org Received: (qmail 2358 invoked by uid 500); 5 Aug 2010 21:42:14 -0000 Mailing-List: contact river-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: river-dev@incubator.apache.org Delivered-To: mailing list river-dev@incubator.apache.org Received: (qmail 2350 invoked by uid 99); 5 Aug 2010 21:42:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Aug 2010 21:42:13 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of pats@acm.org designates 209.86.89.61 as permitted sender) Received: from [209.86.89.61] (HELO elasmtp-galgo.atl.sa.earthlink.net) (209.86.89.61) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Aug 2010 21:42:04 +0000 Received: from [70.230.195.166] (helo=[192.168.1.105]) by elasmtp-galgo.atl.sa.earthlink.net with esmtpa (Exim 4.67) (envelope-from ) id 1Oh8C8-0000aj-8d for river-dev@incubator.apache.org; Thu, 05 Aug 2010 17:41:44 -0400 Message-ID: <4C5B3015.8020509@acm.org> Date: Thu, 05 Aug 2010 14:41:41 -0700 From: Patricia Shanahan User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.11) Gecko/20100711 Thunderbird/3.0.6 MIME-Version: 1.0 To: river-dev@incubator.apache.org Subject: Re: TaskManager progress References: <4C43E08A.8040108@acm.org> <4C4619B1.3050802@zeus.net.au> <4C462450.4090808@acm.org> <4C465CF1.8080507@zeus.net.au> <4C47185D.9050707@wonderly.org> <4C47213F.9020309@acm.org> <4C474790.4060205@acm.org> <4C475169.3040201@wonderly.org> <4C476E88.5040204@acm.org> <4C479013.3030801@zeus.net.au> <4C47D62F.9000309@acm.org> <4C481FDD.6000006@zeus.net.au> <4C482906.7000708@zeus.net.au> <4C483706.5060502@acm.org> <4C4897B0.3020109@wonderly.org> <4C49F670.9020904@acm.org> <4C51A7A7.60904@wonderly.org> <4C520C54.4040008@acm.org> <4C533B20.9040403@wonderly.org> <4C5A5502.1040208@acm.org> <4C5A9192.1060000@zeus.net.au> <4C5AB8E1.6090101@wonderly.org> In-Reply-To: <4C5AB8E1.6090101@wonderly.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-ELNK-Trace: 9a090983a806273c061ba25959e76cc985338a7d01cb3b6a7e972de0d01da940ee9db6641118acf5719532cfbe179fca350badd9bab72f9c350badd9bab72f9c X-Originating-IP: 70.230.195.166 X-Virus-Checked: Checked by ClamAV on apache.org addProxyReg is called from the LookupCacheImpl constructor and from the discover method of a DiscoveryListener attached to a LookupDiscoveryManager. I don't see any specific rules about thread choice for the discover call, but as far as I can tell the lookup discovery manager does not know anything about the service discovery manager's cache, and so is unlikely to use its TaskManager. A LookupTask is created in the run method of a RegisterListenerTask, which is the class of task that addProxyReg adds to the cacheTaskMgr TaskManager. RegisterListenerTask inherits the CacheTask runAfter method that just returns false, so there is nothing to stop it from running in parallel with any older task. Looks like two or more different threads to me. I think a simpler explanation for the lack of problem reports lies in two effects: 1. Almost all the time, addProxyReg will win the race. It has a significant running start. Immediately after the end of its synchronized block, it drops straight into calling the TaskManager add method. Meanwhile, another thread that was at the start of its synchronized block would have to go through creating a new Task object before attempting the add call. For addProxyReg to lose the race for the add method's TaskManager synchronization would require a cache miss or an interrupt in a window a few instructions long. 2. The symptom, if any, would be cache confusion that would be very hard to distinguish from a variation on https://issues.apache.org/jira/browse/RIVER-324. If there were any observations of this problem before RIVER-324 was fixed, they would have been conflated with it and presumed fixed. I am a strong believer in closing even the tiniest timing windows, because they can collectively lead to general flakiness even if there are no reproducible bug reports. Patricia On 8/5/2010 6:13 AM, Gregg Wonderly wrote: > I haven't looked yet, but a quick thought I had was, are the other > dependent instances (higher sequence numbers) actually created on a > separate thread, so that ordering could be compromised, or are they just > created in later in the same threads execution path? > > Gregg Wonderly > > Peter Firmstone wrote: >> Thanks Patricia, sharp eyes! >> >> Cheers, >> >> Peter. >> >> Patricia Shanahan wrote: ... >>> ServiceDiscoveryManager.LookupCacheImpl.taskSeqN is used to >>> initialize sequence number fields in some CacheTask subclasses that >>> are then used in runAfter methods. >>> >>> ServiceDiscoveryManager.LookupCacheImpl.addProxyReg creates a task >>> with incremented taskSeqN inside a serviceIdMap synchronized block, >>> but adds it to the cacheTaskMgr outside the block. >>> >>> public void addProxyReg(ProxyReg reg) { >>> RegisterListenerTask treg; >>> synchronized(serviceIdMap) { >>> treg = new RegisterListenerTask(reg, taskSeqN++); >>> }//end sync(serviceIdMap) >>> cacheTaskMgr.add(treg); >>> }//end LookupCacheImpl.addProxyReg >>> >>> In the remaining sequence numbered CacheTask subclass cases, the >>> cacheTaskMgr.add call is done inside the same >>> synchronized(serviceIdMap) block as the taskSeqN increment. >>> >>> There is a window in addProxyReg during which a LookupTask or >>> NotifyEventTask with a higher sequence number could be added before >>> the RegisterListenerTask task addProxyReg is adding. The LookupTask >>> and NotifyEventTask runAfter methods both wait for any >>> RegisterEventTask with a lower sequence number. >>> >>> I believe this is a bug, but it will be hard to construct a test case >>> because it involves such a narrow timing window. I do recommend >>> moving the cacheTaskMgr.add(treg) statement inside the synchronized >>> block.