harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao-Feng Li" <xiaofeng...@gmail.com>
Subject Re: [drlvm][threading] a description of thread suspension race conditions plus initial ideas on how to fix
Date Mon, 20 Aug 2007 02:52:23 GMT
On 8/19/07, Weldon Washburn <weldonwjw@gmail.com> wrote:
> All,
>
> Four things.
>
> 1)
> I put a description of thread suspension race conditions in
> harmony-4644, HarmonyJVM_Thread_State_Diagrams.ppt.  If the analysis
> is correct, there is a race condition where the suspend request thread
> can see a stale value of the target thread's disable_count.  In other
> words, the suspend request thread can set the target thread->request
> field to true then (accidentally) read a stale value of
> thread->disable_count.  The stale value misleads the requestor thread
> to think it is OK to enumerate the target's stack.
>
> 2)
> Below is rough code that shows an algorithm that might fix the above
> problem.  The concept is to attempt to repair the existing code base
> rather than a complete redesign.  Basically the idea is that the
> target thread will set disable_count = 1 *before* it checks
> thread->request.  And the suspension requestor thread needs to wait
> for the HW store buffer to drain.  I do not know of any HW
> architectural guarantees on the maximum clock cycles to drain the
> store buffer.  I made an intial guess of 300Msec.  It may be that
> 30msec is sufficient for many generations of future HW.  I simply do
> not know at this point.  Please see the below code for more details.
> (Even more details are in harmony-4644,  AAA_thread_suspension.cpp.)

I guess it's not a good idea to sleep explicitly for 300ms. A minor
collection can be as fast as tens of ms.

> 3)
> A question for GC folks.  What is the maximum sleep() time in the
> below code that is acceptable?  Note that for GC suspending a single
> thread, we always need to wait for the HW store buffer to drain.
> However for stop-the-world GC, we do not neccessarily have to wait for
> each target thread's store buffer to drain. We can set thread->request
> on all threads, then do a single sleep(xx msec) to ensure the store
> buffers have drained.

To force the store buffer to drain, I think a safer (and more elegant)
way is to kill a signal to the target thread, so that the target
thread is interrupted.

I am afraid this kind of explicit timing control as 300msec sleep
should always be avoided.

> 4)
> Next steps -- clean up the below suspension kernel and run experiments
> on 4-way SMP to look for race conditions.  I hope to do this today.
>
>
>
> ///////////////////////////////////////////////
>
> void mutator__hythread_suspend_disable()
> {
>     hythread_t self = get_TLS_thread();
>
>     // the following write will slowly dribble out to the coherency domain
>     // we could add an mfence but this would slow down
> hythread_suspend_disable() execution (bad idea)
>     // instead of an mfence() in the target thread, we sleep 300msec
> in the requestor thread to
>     // allow the mutator's HW store buffer to empty
>     self->disable_count++;
>
>     if (self->disable_count == 1) {
>         if (readLocation(&self->request) ) {
>
>             self->disable_count = 0;
>             self->now_blocked_at_disable_count_0_to_1 = true;
>             mfence();
>
>             while (self->j_l_t_suspend_request || self->gc_suspend_request) {
>                 // put a timeout to ensure a missed semaphore set does not
>                 // hang the system (graceful degradation)
>                 status = hysem_wait(self->resume_event, timout_3000msec);
>                 if (status == timeout && self->gc_suspend_request) log
> a warning/diagnostic somewhere
>             }
>
>             self->disable_count = 1;
>             self->now_blocked_at_disable_count_0_to_1 = false;
>             mfence();
>         }
>         else {
>             // intentionally do nothing
>             // eventually disable_count going from 0 to 1 will make it
> into the coherency domain
>         }
> }
>
> void hythread_suspend_other(hythread_t thread, suspend_t kind)
> {
>     if (kind == J_L_T_SUSPEND) {
>         if (thread->j_l_t_suspend_request == true) return;  //its
> already been dealt with
>         thread->j_l_t_suspend_request = true;
>     } else {
>         assert(kind == GCSUSPEND);
>         // only allow one GC at a time for now
>         assert(thread->gc_suspend_request = false);
>         thread->gc_suspend_request = true;
>     }
>     apr_atomic_inc32(&thread->request);
>
>     // if the target thread just did a "disable_count = 1", the below
> sleep will allow
>     // enough time so that the target thread's HW store buffer will
> empty into the
>     // coherency domain
>     sleep(300msec);
>
>     //we now know we are not fetching stale version of disable_count
> from coherency domain
>     while (readLocation(&thread->disable_count) != 0) {
>         yield();
>     }
>     if ( (thread->now_blocked_at_back_branch == 0) &&
>         (thread->now_blocked_at_disable_count_0_to_1 == 0) ) {
>             assert(0);
>     }
> }
>


-- 
http://xiao-feng.blogspot.com

Mime
View raw message