river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter <j...@zeus.net.au>
Subject Re: ServiceDiscoveryManager Task Race
Date Fri, 17 Jan 2014 23:46:23 GMT
So, after spending some time thinking and going over notes...

SDM EventReg keeps tally of the latest event ID, and SDM attempts to execute tasks in order
for each ServiceID, so this seems the best place to store two small PriorityQueue's, one for
pending tasks, the other for executing tasks.

Callbacks can be used to remove tasks from the executing tasks queue after completion, also
allowing currently executing tasks to execute another task while taking precedence over pending
tasks would address some of the race conditions.

Regards,

Peter.

Regards,

Peter.

----- Original message -----
> ----- Original message -----
> > I don't think it is just a matter of latency. When I last looked
> > closely   at TaskManager, it appeared to me that it kept each task in
> > the runAfter   checking list until the task returned from its run
> > method. That way, a   dependent task was prevented from running until
> > completion of the task   on which it depends.
>
> Yes, your right, that it does, it was only cases of RetryTask that's
> removed after each unsuccesfull attempt.
>
>
> >
> > I can see two approaches to this. One is to make the TaskManager
> > replacement do something similar, wrapping the client tasks in code
> > its   own code to keep track of them for dependency purposes until they
> > finish. The other is to dump the whole dependency issue on the client
> > code, which can use anything in the entire range of concurrency
> > control   mechanisms, including atomicity, synchronization, semaphores,
> > etc.
>
> Comments made on the list so far indicate people want configurable
> Executors, which makes sense because it allows deployment time
> optimisation and we don't get much feedback or info about that.
>
> Having said that though, there's a lot of mutability in SDM's
> implementation and that's difficult to manage.
>
> Cheers,
>
> Peter.
>
> >
> > Patricia
> >
> > On 1/14/2014 3:32 AM, Peter Firmstone wrote:
> > > With TaskManager.Task.runAfter, throughput wasn't significant enough
> > > for this race to occur.
> > >
> > > If I make the ExecutorService single threaded, the error doesn't
> > > occur as the tasks are executed in correct dependency order,
> > > however, when the ExecutorService has a lot of threads ready, the
> > > tasks aren't able to be arranged in a queue as they're immediately
> > > handed off to a waiting thread.
> > >
> > > TaskManager added sufficient latency to prevent this race from
> > > occurring.
> > >
> > > I think the solution to this race condition (note that all accesses
> > > are synchronized) is to make the the operations atomic and separate
> > > out the event notifications.
> > >
> > > Thoughts?
> > >
> > > Peter.
> > >
> > > com/sun/jini/test/impl/servicediscovery/event/LookupTaskServiceIdMapRace.td
> > >
> > > **
> > > * This test attempts to simulate the following race condition that
> > > * can occur between an instance of UnmapProxyTask (created and queued
> > > * in LookupTask) and instances of NewOldServiceTask that are created
> > > * and queued by NotifyEventTask:
> > > *
> > > * - 1 LUS {L0}
> > > * - N (~250) services {s0, s1, ..., sN-1}, to be registered in L0
> > > * - M (~24) SDMs, each with 1 cache with template matching all si's
> > > *      {SDM_0/C0, SDM_1/C1, ... SDM_M-1/CM-1}
> > > *
> > > * Through the shear number of service registrations, caches, and
> > > events, * this test attempts to produce the conditions that cause the
> > > regular * occurrence of the race between an instance of
> > > UnmapProxyTask and * instances of NewOldServiceTask produced by
> > > NotifyEventTask when a * service event is received from L0.
> > > *
> > > * This test starts lookup L0 during construct. Then, when the test
> > > begins * running, half the services are registered with L0, followed
> > > by the * creation of half the SDMs and corresponding caches; which
> > > causes the * tasks being tested to be queued, and event generation to
> > > ultimately * begin. After registering the first half of the services
> > > and creating * the first half of the SDMs, the remaining services are
> > > registered and * the remaining SDMs and caches are created. As events
> > > are generated, * the number of serviceAdded and serviceRemoved events
> > > are tallied. *
> > > * When an SDM_i/cach_i pair is created, an instance of
> > > RegisterListenerTask
> > > * is queued and executed. RegisterListenerTask registers a remote
> > > event * listener with L0's event mechanism. When the services are
> > > registered with
> > > * L0, that listener receives service events; which causes
> > > NotifyEventTask * to be queued and executed. After RegisterListerTask
> > > registers for events
> > > * with L0, but before RegisterListenerTask exits, an instance of
> > > LookupTask
> > > * is queued and executed. LookupTask retrieves from L0 a "snapshot"
> > > of its
> > > * state. Thus, while events begin to arrive informing each cache of
> > > the * services that are registering with L0, LookupTask is querying
> > > L0 for * its current state.
> > > *
> > > * Upon receipt of a service event, NotifyEventTask queues a
> > > NewOldServiceTask
> > > * to determine if the service corresponding to the event represents a
> > > new * service that has been added, a change to a
> > > previously-registered service,
> > > * or the removal of a service from L0. If the event corresponds to a
> > > newly
> > > * registered service, the service is added to the cache's
> > > serviceIdMap and
> > > * a serviceAdded event is sent to any listeners registered with the
> > > cache.
> > > * That is,
> > > *
> > > * Service event received
> > > *
> > > *      NotifyEventTask {
> > > *          if (service removed) {
> > > *              remove service from serviceIdMap
> > > *              send serviceRemoved
> > > *          } else {
> > > *              NewOldServiceTask
> > > *                  if (service changed) {
> > > *                      send serviceChanged
> > > *                  } else if (service is new) {
> > > *                      add service to serviceIdMap
> > > *                      send serviceAdded
> > > *                  }
> > > *          }
> > > *      }
> > > *
> > > * While events are being received and processed by NotifyEventTask
> > > and * NewOldServiceTask, LookupTask is asynchronously requesting a
> > > snapshot * of L0's state and attempting to process that snapshot to
> > > populate * the same serviceIdMap that is being populated by
> > > instances of * NewOldServiceTask that are initiated by
> > > NotifyEventTask. LookupTask * first examines serviceIdMap, looking
> > > for services that are NOT in the * snapshot; that is, services that
> > > are not currently registered with L0. * Such a service is referred
> > > to as an, "orphan". For each orphan service * that LookupTask finds,
> > > an instance of UnmapProxyTask is queued. That task
> > > * removes the service from the serviceIdMap and sends a
> > > serviceRemoved * event to any listeners registered with the cache.
> > > After processing * any orphans that it finds, LookupTask then queues
> > > an instance of * NewOldServiceTask for each service in the snapshot
> > > previously retrieved.
> > > * That is,
> > > *
> > > * LookupTask - retrieve snapshot {
> > > *
> > > *      for each service in serviceIdMap {
> > > *          if (service is not in snapshot) { //orphan
> > > *              UnmapProxyTask {
> > > *                  remove service from serviceIdMap
> > > *                  send serviceRemoved
> > > *              }
> > > *          }
> > > *      }
> > > *      for each service in snapshot {
> > > *              NewOldServiceTask
> > > *                  if (service changed) {
> > > *                      send serviceChanged
> > > *                  } else if (service is new) {
> > > *                      add service to serviceIdMap
> > > *                      send serviceAdded
> > > *                  }
> > > *      }
> > > * }
> > > *
> > > * The race can occur because the NewOldServiceTasks that are queued
> > > by the
> > > * NotifyEventTasks can add services to the serviceIdMap between the
> > > time * LookupTask retrieves the snapshot and the time it analyzes the
> > > serviceIdMap
> > > * for orphans. That is,
> > > *
> > > *                                                o
SDM_i/cache_i
> > > created * RegisterListenerTask
> > > * --------------------
> > > *    register for events
> > > *    LookupTask
> > > *    ----------
> > > *      retrieve snapshot {s0,s1,s2}
> > > *                                                o
s3 registered with
> > > L0 *                                              
 o L0 sends
> > > NO_MATCH_MATCH *                                       
           
> > >                                                  
NotifyEventTask * 
> > >                                                  
                 
> > >                                 --------------- *       
           
> > >                                                  
                 
> > >                   NewOldServiceTask *                   
           
> > >                                                  
                 
> > >       ----------------- *                               
           
> > >                                                  
            add s3
> > > to serviceIdMap *                                       
           
> > >                                                  
    send
> > > serviceAdded event *      ORPHAN: s3 in serviceIdMap, not snapshot * 
> > >     UnmapProxyTask *      -------------- *          remove s3 from
> > > serviceIdMap *          send serviceRemoved event *
> > > * This test returns a pass when no race is detected between
> > > UnmapProxyTask
> > > * and any NewOldServiceTask initiated by a NotifyEventTask. This is
> > > * determined by examining the serviceAdded and serviceRemoved event
> > > * tallies collected during test execution. If, for each SDM/cache
> > > * combination, the number of serviceAdded events received equals the
> > > * number of services registered with L0, and no serviceRemoved events
> > > * are received, then there is no race, and the test passes;
> > > otherwise, * the test fails (in particular, if at least one
> > > serviceRemoved event * is sent by at least one SDM/cache).
> > > *
> > > * No special modifications to the SDM are required to cause the race
> > > * condition to occur consistently. When running this test
> > > individually * on Solaris, out of "the vob", under a JERI or JRMP
> > > configuration, and * with 24 SDMs/caches and 250 services, the race
> > > condition was consistently
> > > * observed (until a fix was integrated). Thus, it appears that the
> > > greater
> > > * the number of SDMs/caches/services, the greater the probability the
> > > * conditions for the race will be encountered.
> > > *
> > > * Related bug ids: 6291851
> > > */
> >
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message