Return-Path: Delivered-To: apmail-incubator-river-dev-archive@minotaur.apache.org Received: (qmail 24161 invoked from network); 6 Aug 2010 14:48:32 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Aug 2010 14:48:32 -0000 Received: (qmail 62455 invoked by uid 500); 6 Aug 2010 14:48:32 -0000 Delivered-To: apmail-incubator-river-dev-archive@incubator.apache.org Received: (qmail 62308 invoked by uid 500); 6 Aug 2010 14:48:30 -0000 Mailing-List: contact river-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: river-dev@incubator.apache.org Delivered-To: mailing list river-dev@incubator.apache.org Received: (qmail 62300 invoked by uid 99); 6 Aug 2010 14:48:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Aug 2010 14:48:29 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of SRS0=WCHzYf=PL=wonderly.org=gregg@yourhostingaccount.com designates 65.254.253.29 as permitted sender) Received: from [65.254.253.29] (HELO mailout03.yourhostingaccount.com) (65.254.253.29) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Aug 2010 14:48:21 +0000 Received: from mailscan21.yourhostingaccount.com ([10.1.15.21] helo=mailscan21.yourhostingaccount.com) by mailout03.yourhostingaccount.com with esmtp (Exim) id 1OhODH-0000nB-7o for river-dev@incubator.apache.org; Fri, 06 Aug 2010 10:47:59 -0400 Received: from impout02.yourhostingaccount.com ([10.1.55.2] helo=impout02.yourhostingaccount.com) by mailscan21.yourhostingaccount.com with esmtp (Exim) id 1OhODG-0008Im-TR for river-dev@incubator.apache.org; Fri, 06 Aug 2010 10:47:58 -0400 Received: from authsmtp09.yourhostingaccount.com ([10.1.18.9]) by impout02.yourhostingaccount.com with NO UCE id r2nn1e0060BkWne0000000; Fri, 06 Aug 2010 10:47:47 -0400 X-EN-OrigOutIP: 10.1.18.9 X-EN-IMPSID: r2nn1e0060BkWne0000000 Received: from wsip-70-184-37-175.tu.ok.cox.net ([70.184.37.175] helo=[192.168.5.109]) by authsmtp09.yourhostingaccount.com with esmtpsa (TLSv1:AES256-SHA:256) (Exim) id 1OhOD5-0006gb-CZ for river-dev@incubator.apache.org; Fri, 06 Aug 2010 10:47:47 -0400 Message-ID: <4C5C2093.9010902@wonderly.org> Date: Fri, 06 Aug 2010 09:47:47 -0500 From: Gregg Wonderly User-Agent: Thunderbird 2.0.0.24 (Windows/20100228) MIME-Version: 1.0 To: river-dev@incubator.apache.org Subject: Re: TaskManager progress References: <4C43E08A.8040108@acm.org> <4C4619B1.3050802@zeus.net.au> <4C462450.4090808@acm.org> <4C465CF1.8080507@zeus.net.au> <4C47185D.9050707@wonderly.org> <4C47213F.9020309@acm.org> <4C474790.4060205@acm.org> <4C475169.3040201@wonderly.org> <4C476E88.5040204@acm.org> <4C479013.3030801@zeus.net.au> <4C47D62F.9000309@acm.org> <4C481FDD.6000006@zeus.net.au> <4C482906.7000708@zeus.net.au> <4C483706.5060502@acm.org> <4C4897B0.3020109@wonderly.org> <4C49F670.9020904@acm.org> <4C51A7A7.60904@wonderly.org> <4C520C54.4040008@acm.org> <4C533B20.9040403@wonderly.org> <4C5A5502.1040208@acm.org> <4C5A9192.1060000@zeus.net.au> <4C5AB8E1.6090101@wonderly.org> <4C5B3015.8020509@acm.org> In-Reply-To: <4C5B3015.8020509@acm.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-EN-UserInfo: 5bac21c6012e8295aaee92c67842fba3:d1e94006e19829b2b3cf849ab9ff0f3c X-EN-AuthUser: greggwon Sender: Gregg Wonderly X-EN-OrigIP: 70.184.37.175 X-EN-OrigHost: wsip-70-184-37-175.tu.ok.cox.net Okay, it looks like there are several things that we need to look at with the SDM code then(*). I think we need to look carefully at the mutations of the data that are occurring and that River-324 tries to help manage to see if we can get our heads around it and hopefully make sure that we believe that River-324's problem of stale or wrong cache contents can be managed more readily. Gregg Wonderly * I had a hard time getting SDM to work reliably for me, and in particular, it initially did not perform unicast lookups, but waited for multi-cast announcements only and this made it non-functional on non-local network segments. Patricia Shanahan wrote: > addProxyReg is called from the LookupCacheImpl constructor and from the > discover method of a DiscoveryListener attached to a > LookupDiscoveryManager. I don't see any specific rules about thread > choice for the discover call, but as far as I can tell the lookup > discovery manager does not know anything about the service discovery > manager's cache, and so is unlikely to use its TaskManager. > > A LookupTask is created in the run method of a RegisterListenerTask, > which is the class of task that addProxyReg adds to the cacheTaskMgr > TaskManager. RegisterListenerTask inherits the CacheTask runAfter method > that just returns false, so there is nothing to stop it from running in > parallel with any older task. > > Looks like two or more different threads to me. I think a simpler > explanation for the lack of problem reports lies in two effects: > > 1. Almost all the time, addProxyReg will win the race. It has a > significant running start. Immediately after the end of its synchronized > block, it drops straight into calling the TaskManager add method. > Meanwhile, another thread that was at the start of its synchronized > block would have to go through creating a new Task object before > attempting the add call. For addProxyReg to lose the race for the add > method's TaskManager synchronization would require a cache miss or an > interrupt in a window a few instructions long. > > 2. The symptom, if any, would be cache confusion that would be very hard > to distinguish from a variation on > https://issues.apache.org/jira/browse/RIVER-324. If there were any > observations of this problem before RIVER-324 was fixed, they would have > been conflated with it and presumed fixed. > > I am a strong believer in closing even the tiniest timing windows, > because they can collectively lead to general flakiness even if there > are no reproducible bug reports. > > Patricia > > > > > > On 8/5/2010 6:13 AM, Gregg Wonderly wrote: >> I haven't looked yet, but a quick thought I had was, are the other >> dependent instances (higher sequence numbers) actually created on a >> separate thread, so that ordering could be compromised, or are they just >> created in later in the same threads execution path? >> >> Gregg Wonderly >> >> Peter Firmstone wrote: >>> Thanks Patricia, sharp eyes! >>> >>> Cheers, >>> >>> Peter. >>> >>> Patricia Shanahan wrote: > ... >>>> ServiceDiscoveryManager.LookupCacheImpl.taskSeqN is used to >>>> initialize sequence number fields in some CacheTask subclasses that >>>> are then used in runAfter methods. >>>> >>>> ServiceDiscoveryManager.LookupCacheImpl.addProxyReg creates a task >>>> with incremented taskSeqN inside a serviceIdMap synchronized block, >>>> but adds it to the cacheTaskMgr outside the block. >>>> >>>> public void addProxyReg(ProxyReg reg) { >>>> RegisterListenerTask treg; >>>> synchronized(serviceIdMap) { >>>> treg = new RegisterListenerTask(reg, taskSeqN++); >>>> }//end sync(serviceIdMap) >>>> cacheTaskMgr.add(treg); >>>> }//end LookupCacheImpl.addProxyReg >>>> >>>> In the remaining sequence numbered CacheTask subclass cases, the >>>> cacheTaskMgr.add call is done inside the same >>>> synchronized(serviceIdMap) block as the taskSeqN increment. >>>> >>>> There is a window in addProxyReg during which a LookupTask or >>>> NotifyEventTask with a higher sequence number could be added before >>>> the RegisterListenerTask task addProxyReg is adding. The LookupTask >>>> and NotifyEventTask runAfter methods both wait for any >>>> RegisterEventTask with a lower sequence number. >>>> >>>> I believe this is a bug, but it will be hard to construct a test case >>>> because it involves such a narrow timing window. I do recommend >>>> moving the cacheTaskMgr.add(treg) statement inside the synchronized >>>> block. >