river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patricia Shanahan <p...@acm.org>
Subject Re: ServiceDiscoveryManager Task Race
Date Tue, 14 Jan 2014 16:39:14 GMT
I don't think it is just a matter of latency. When I last looked closely 
at TaskManager, it appeared to me that it kept each task in the runAfter 
checking list until the task returned from its run method. That way, a 
dependent task was prevented from running until completion of the task 
on which it depends.

I can see two approaches to this. One is to make the TaskManager 
replacement do something similar, wrapping the client tasks in code its 
own code to keep track of them for dependency purposes until they 
finish. The other is to dump the whole dependency issue on the client 
code, which can use anything in the entire range of concurrency control 
mechanisms, including atomicity, synchronization, semaphores, etc.

Patricia

On 1/14/2014 3:32 AM, Peter Firmstone wrote:
> With TaskManager.Task.runAfter, throughput wasn't significant enough for
> this race to occur.
>
> If I make the ExecutorService single threaded, the error doesn't occur
> as the tasks are executed in correct dependency order, however, when the
> ExecutorService has a lot of threads ready, the tasks aren't able to be
> arranged in a queue as they're immediately handed off to a waiting thread.
>
> TaskManager added sufficient latency to prevent this race from occurring.
>
> I think the solution to this race condition (note that all accesses are
> synchronized) is to make the the operations atomic and separate out the
> event notifications.
>
> Thoughts?
>
> Peter.
>
> com/sun/jini/test/impl/servicediscovery/event/LookupTaskServiceIdMapRace.td
>
> **
>   * This test attempts to simulate the following race condition that
>   * can occur between an instance of UnmapProxyTask (created and queued
>   * in LookupTask) and instances of NewOldServiceTask that are created
>   * and queued by NotifyEventTask:
>   *
>   * - 1 LUS {L0}
>   * - N (~250) services {s0, s1, ..., sN-1}, to be registered in L0
>   * - M (~24) SDMs, each with 1 cache with template matching all si's
>   *   {SDM_0/C0, SDM_1/C1, ... SDM_M-1/CM-1}
>   *
>   * Through the shear number of service registrations, caches, and events,
>   * this test attempts to produce the conditions that cause the regular
>   * occurrence of the race between an instance of UnmapProxyTask and
>   * instances of NewOldServiceTask produced by NotifyEventTask when a
>   * service event is received from L0.
>   *
>   * This test starts lookup L0 during construct. Then, when the test begins
>   * running, half the services are registered with L0, followed by the
>   * creation of half the SDMs and corresponding caches; which causes the
>   * tasks being tested to be queued, and event generation to ultimately
>   * begin. After registering the first half of the services and creating
>   * the first half of the SDMs, the remaining services are registered and
>   * the remaining SDMs and caches are created. As events are generated,
>   * the number of serviceAdded and serviceRemoved events are tallied.
>   *
>   * When an SDM_i/cach_i pair is created, an instance of
> RegisterListenerTask
>   * is queued and executed. RegisterListenerTask registers a remote event
>   * listener with L0's event mechanism. When the services are registered
> with
>   * L0, that listener receives service events; which causes NotifyEventTask
>   * to be queued and executed. After RegisterListerTask registers for
> events
>   * with L0, but before RegisterListenerTask exits, an instance of
> LookupTask
>   * is queued and executed. LookupTask retrieves from L0 a "snapshot" of
> its
>   * state. Thus, while events begin to arrive informing each cache of the
>   * services that are registering with L0, LookupTask is querying L0 for
>   * its current state.
>   *
>   * Upon receipt of a service event, NotifyEventTask queues a
> NewOldServiceTask
>   * to determine if the service corresponding to the event represents a new
>   * service that has been added, a change to a previously-registered
> service,
>   * or the removal of a service from L0. If the event corresponds to a
> newly
>   * registered service, the service is added to the cache's serviceIdMap
> and
>   * a serviceAdded event is sent to any listeners registered with the
> cache.
>   * That is,
>   *
>   * Service event received
>   *
>   *   NotifyEventTask {
>   *     if (service removed) {
>   *       remove service from serviceIdMap
>   *       send serviceRemoved
>   *     } else {
>   *       NewOldServiceTask
>   *         if (service changed) {
>   *           send serviceChanged
>   *         } else if (service is new) {
>   *           add service to serviceIdMap
>   *           send serviceAdded
>   *         }
>   *     }
>   *   }
>   *
>   * While events are being received and processed by NotifyEventTask and
>   * NewOldServiceTask, LookupTask is asynchronously requesting a snapshot
>   * of L0's state and attempting to process that snapshot to populate
>   * the same serviceIdMap that is being populated by instances of
>   * NewOldServiceTask that are initiated by NotifyEventTask. LookupTask
>   * first examines serviceIdMap, looking for services that are NOT in the
>   * snapshot; that is, services that are not currently registered with L0.
>   * Such a service is referred to as an, "orphan". For each orphan service
>   * that LookupTask finds, an instance of UnmapProxyTask is queued. That
> task
>   * removes the service from the serviceIdMap and sends a serviceRemoved
>   * event to any listeners registered with the cache. After processing
>   * any orphans that it finds, LookupTask then queues an instance of
>   * NewOldServiceTask for each service in the snapshot previously
> retrieved.
>   * That is,
>   *
>   * LookupTask - retrieve snapshot {
>   *
>   *   for each service in serviceIdMap {
>   *     if (service is not in snapshot) { //orphan
>   *       UnmapProxyTask {
>   *         remove service from serviceIdMap
>   *         send serviceRemoved
>   *       }
>   *     }
>   *   }
>   *   for each service in snapshot {
>   *       NewOldServiceTask
>   *         if (service changed) {
>   *           send serviceChanged
>   *         } else if (service is new) {
>   *           add service to serviceIdMap
>   *           send serviceAdded
>   *         }
>   *   }
>   * }
>   *
>   * The race can occur because the NewOldServiceTasks that are queued by
> the
>   * NotifyEventTasks can add services to the serviceIdMap between the time
>   * LookupTask retrieves the snapshot and the time it analyzes the
> serviceIdMap
>   * for orphans. That is,
>   *
>   *                        o SDM_i/cache_i created
>   * RegisterListenerTask
>   * --------------------
>   *  register for events
>   *  LookupTask
>   *  ----------
>   *   retrieve snapshot {s0,s1,s2}
>   *                        o s3 registered with L0
>   *                        o L0 sends NO_MATCH_MATCH
>   *                                                   NotifyEventTask
>   *                                                   ---------------
>   *                                                     NewOldServiceTask
>   *                                                     -----------------
>   *                                                     add s3 to
> serviceIdMap
>   *                                                     send
> serviceAdded event
>   *   ORPHAN: s3 in serviceIdMap, not snapshot
>   *   UnmapProxyTask
>   *   --------------
>   *     remove s3 from serviceIdMap
>   *     send serviceRemoved event
>   *
>   * This test returns a pass when no race is detected between
> UnmapProxyTask
>   * and any NewOldServiceTask initiated by a NotifyEventTask. This is
>   * determined by examining the serviceAdded and serviceRemoved event
>   * tallies collected during test execution. If, for each SDM/cache
>   * combination, the number of serviceAdded events received equals the
>   * number of services registered with L0, and no serviceRemoved events
>   * are received, then there is no race, and the test passes; otherwise,
>   * the test fails (in particular, if at least one serviceRemoved event
>   * is sent by at least one SDM/cache).
>   *
>   * No special modifications to the SDM are required to cause the race
>   * condition to occur consistently. When running this test individually
>   * on Solaris, out of "the vob", under a JERI or JRMP configuration, and
>   * with 24 SDMs/caches and 250 services, the race condition was
> consistently
>   * observed (until a fix was integrated). Thus, it appears that the
> greater
>   * the number of SDMs/caches/services, the greater the probability the
>   * conditions for the race will be encountered.
>   *
>   * Related bug ids: 6291851
>   */


Mime
View raw message