river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregg Wonderly <gr...@wonderly.org>
Subject Re: TaskManager progress
Date Fri, 06 Aug 2010 14:47:47 GMT
Okay, it looks like there are several things that we need to look at with the 
SDM code then(*).  I think we need to look carefully at the mutations of the 
data that are occurring and that River-324 tries to help manage to see if we can 
get our heads around it and hopefully make sure that we believe that River-324's 
problem of stale or wrong cache contents can be managed more readily.

Gregg Wonderly

* I had a hard time getting SDM to work reliably for me, and in particular, it 
initially did not perform unicast lookups, but waited for multi-cast 
announcements only and this made it non-functional on non-local network segments.

Patricia Shanahan wrote:
> addProxyReg is called from the LookupCacheImpl constructor and from the 
> discover method of a DiscoveryListener attached to a 
> LookupDiscoveryManager. I don't see any specific rules about thread 
> choice for the discover call, but as far as I can tell the lookup 
> discovery manager does not know anything about the service discovery 
> manager's cache, and so is unlikely to use its TaskManager.
> A LookupTask is created in the run method of a RegisterListenerTask, 
> which is the class of task that addProxyReg adds to the cacheTaskMgr 
> TaskManager. RegisterListenerTask inherits the CacheTask runAfter method 
> that just returns false, so there is nothing to stop it from running in 
> parallel with any older task.
> Looks like two or more different threads to me. I think a simpler 
> explanation for the lack of problem reports lies in two effects:
> 1. Almost all the time, addProxyReg will win the race. It has a 
> significant running start. Immediately after the end of its synchronized 
> block, it drops straight into calling the TaskManager add method. 
> Meanwhile, another thread that was at the start of its synchronized 
> block would have to go through creating a new Task object before 
> attempting the add call. For addProxyReg to lose the race for the add 
> method's TaskManager synchronization would require a cache miss or an 
> interrupt in a window a few instructions long.
> 2. The symptom, if any, would be cache confusion that would be very hard 
> to distinguish from a variation on 
> https://issues.apache.org/jira/browse/RIVER-324. If there were any 
> observations of this problem before RIVER-324 was fixed, they would have 
> been conflated with it and presumed fixed.
> I am a strong believer in closing even the tiniest timing windows, 
> because they can collectively lead to general flakiness even if there 
> are no reproducible bug reports.
> Patricia
> On 8/5/2010 6:13 AM, Gregg Wonderly wrote:
>> I haven't looked yet, but a quick thought I had was, are the other
>> dependent instances (higher sequence numbers) actually created on a
>> separate thread, so that ordering could be compromised, or are they just
>> created in later in the same threads execution path?
>> Gregg Wonderly
>> Peter Firmstone wrote:
>>> Thanks Patricia, sharp eyes!
>>> Cheers,
>>> Peter.
>>> Patricia Shanahan wrote:
> ...
>>>> ServiceDiscoveryManager.LookupCacheImpl.taskSeqN is used to
>>>> initialize sequence number fields in some CacheTask subclasses that
>>>> are then used in runAfter methods.
>>>> ServiceDiscoveryManager.LookupCacheImpl.addProxyReg creates a task
>>>> with incremented taskSeqN inside a serviceIdMap synchronized block,
>>>> but adds it to the cacheTaskMgr outside the block.
>>>> public void addProxyReg(ProxyReg reg) {
>>>> RegisterListenerTask treg;
>>>> synchronized(serviceIdMap) {
>>>> treg = new RegisterListenerTask(reg, taskSeqN++);
>>>> }//end sync(serviceIdMap)
>>>> cacheTaskMgr.add(treg);
>>>> }//end LookupCacheImpl.addProxyReg
>>>> In the remaining sequence numbered CacheTask subclass cases, the
>>>> cacheTaskMgr.add call is done inside the same
>>>> synchronized(serviceIdMap) block as the taskSeqN increment.
>>>> There is a window in addProxyReg during which a LookupTask or
>>>> NotifyEventTask with a higher sequence number could be added before
>>>> the RegisterListenerTask task addProxyReg is adding. The LookupTask
>>>> and NotifyEventTask runAfter methods both wait for any
>>>> RegisterEventTask with a lower sequence number.
>>>> I believe this is a bug, but it will be hard to construct a test case
>>>> because it involves such a narrow timing window. I do recommend
>>>> moving the cacheTaskMgr.add(treg) statement inside the synchronized
>>>> block.

View raw message