river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patricia Shanahan <p...@acm.org>
Subject Re: TaskManager progress
Date Fri, 06 Aug 2010 17:01:22 GMT
I agree. I probably need to understand both SDM and JoinManager, which 
has its own concurrency issues especially in the area of retries, before 
I can nail the requirements for TaskManager. I'll carry on studying SDM 
for now.

Patricia


On 8/6/2010 7:47 AM, Gregg Wonderly wrote:
> Okay, it looks like there are several things that we need to look at
> with the SDM code then(*). I think we need to look carefully at the
> mutations of the data that are occurring and that River-324 tries to
> help manage to see if we can get our heads around it and hopefully make
> sure that we believe that River-324's problem of stale or wrong cache
> contents can be managed more readily.
>
> Gregg Wonderly
>
>
> * I had a hard time getting SDM to work reliably for me, and in
> particular, it initially did not perform unicast lookups, but waited for
> multi-cast announcements only and this made it non-functional on
> non-local network segments.
>
> Patricia Shanahan wrote:
>> addProxyReg is called from the LookupCacheImpl constructor and from
>> the discover method of a DiscoveryListener attached to a
>> LookupDiscoveryManager. I don't see any specific rules about thread
>> choice for the discover call, but as far as I can tell the lookup
>> discovery manager does not know anything about the service discovery
>> manager's cache, and so is unlikely to use its TaskManager.
>>
>> A LookupTask is created in the run method of a RegisterListenerTask,
>> which is the class of task that addProxyReg adds to the cacheTaskMgr
>> TaskManager. RegisterListenerTask inherits the CacheTask runAfter
>> method that just returns false, so there is nothing to stop it from
>> running in parallel with any older task.
>>
>> Looks like two or more different threads to me. I think a simpler
>> explanation for the lack of problem reports lies in two effects:
>>
>> 1. Almost all the time, addProxyReg will win the race. It has a
>> significant running start. Immediately after the end of its
>> synchronized block, it drops straight into calling the TaskManager add
>> method. Meanwhile, another thread that was at the start of its
>> synchronized block would have to go through creating a new Task object
>> before attempting the add call. For addProxyReg to lose the race for
>> the add method's TaskManager synchronization would require a cache
>> miss or an interrupt in a window a few instructions long.
>>
>> 2. The symptom, if any, would be cache confusion that would be very
>> hard to distinguish from a variation on
>> https://issues.apache.org/jira/browse/RIVER-324. If there were any
>> observations of this problem before RIVER-324 was fixed, they would
>> have been conflated with it and presumed fixed.
>>
>> I am a strong believer in closing even the tiniest timing windows,
>> because they can collectively lead to general flakiness even if there
>> are no reproducible bug reports.
>>
>> Patricia
>>
>>
>>
>>
>>
>> On 8/5/2010 6:13 AM, Gregg Wonderly wrote:
>>> I haven't looked yet, but a quick thought I had was, are the other
>>> dependent instances (higher sequence numbers) actually created on a
>>> separate thread, so that ordering could be compromised, or are they just
>>> created in later in the same threads execution path?
>>>
>>> Gregg Wonderly
>>>
>>> Peter Firmstone wrote:
>>>> Thanks Patricia, sharp eyes!
>>>>
>>>> Cheers,
>>>>
>>>> Peter.
>>>>
>>>> Patricia Shanahan wrote:
>> ...
>>>>> ServiceDiscoveryManager.LookupCacheImpl.taskSeqN is used to
>>>>> initialize sequence number fields in some CacheTask subclasses that
>>>>> are then used in runAfter methods.
>>>>>
>>>>> ServiceDiscoveryManager.LookupCacheImpl.addProxyReg creates a task
>>>>> with incremented taskSeqN inside a serviceIdMap synchronized block,
>>>>> but adds it to the cacheTaskMgr outside the block.
>>>>>
>>>>> public void addProxyReg(ProxyReg reg) {
>>>>> RegisterListenerTask treg;
>>>>> synchronized(serviceIdMap) {
>>>>> treg = new RegisterListenerTask(reg, taskSeqN++);
>>>>> }//end sync(serviceIdMap)
>>>>> cacheTaskMgr.add(treg);
>>>>> }//end LookupCacheImpl.addProxyReg
>>>>>
>>>>> In the remaining sequence numbered CacheTask subclass cases, the
>>>>> cacheTaskMgr.add call is done inside the same
>>>>> synchronized(serviceIdMap) block as the taskSeqN increment.
>>>>>
>>>>> There is a window in addProxyReg during which a LookupTask or
>>>>> NotifyEventTask with a higher sequence number could be added before
>>>>> the RegisterListenerTask task addProxyReg is adding. The LookupTask
>>>>> and NotifyEventTask runAfter methods both wait for any
>>>>> RegisterEventTask with a lower sequence number.
>>>>>
>>>>> I believe this is a bug, but it will be hard to construct a test case
>>>>> because it involves such a narrow timing window. I do recommend
>>>>> moving the cacheTaskMgr.add(treg) statement inside the synchronized
>>>>> block.
>>
>
>


Mime
View raw message