harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao-Feng Li" <xiaofeng...@gmail.com>
Subject Re: [drlvm] run of smoke tests on overloaded box
Date Sat, 16 Jun 2007 06:10:34 GMT
On 6/16/07, Rana Dasgupta <rdasgupt@gmail.com> wrote:
> I could repro a couple of these cases finally( on Linux x64 ) and I
> think that this problem is happening because of the known weakness in
> shutdown of daemon threads.  gdb shows a SIGSEGV in the cancel
> handler, usually reporting a zombie thread.
> In the current shutdown we register safepoint shutdown callbacks and
> do timed joins, waiting for the daemon threads to exit. We make some
> reasonable guess on the join timeout interval. After this, we kill the
> threads ( on Linux with a pthread_cancel ). When the cycle eater runs
> in the background, the join interval we have chosen is not enough. But
> sometimes, between the time we give up on the joins and before we post
> the cancel signals, the thread( default attribute is joinable and not
> detached thread ) finally completes the safepoint shutdown callback
> and exits. It is now a zombie or whatever, and would release all
> resources on join. But in shutdown we have given up on join and has
> started pthread_cancel(). The CANCEL signal fails to handle on the
> zombie thread and raises SIGSEGV. I don't know Linux well enough to
> know the exact dynamics of zombies.

Very interesting study. This situation happens not only here but also
finalizer threads shutdown. We have test case that creates infinite
loop execution in a finalizer (or waiting on a lost socket), requiring
the system can shutdown correctly by sort of figuring out this
situation and not waiting for the (dead) finalizer's finish. At the
same time, we have test case that lets the finalizer to run lots of
heavy duty work, and requiring the system to figure out this situation
and waiting for the finalizer's finish.

In GCv5, we solved the problem (or passed the tests anyway) by letting
the system to timed wait on the finalizers. If at the timeout event we
detect there is at least one finalizer is executed, we will loop back
timed waiting again, since in this case it means the finalizers are
still making progress. If at a timeout event we find the finalizers
number is unchanged, we decide the finalizers are dead and will go on
to exit.

The problem is, we don't know which timeout value is reasonable, 1ms
or 1s. In this case, I personally think a bigger value makes more
sense. Since in our case, the timed wait doesn't need to wait for
timeout, it can also be waken up by the finalizers once they are
finished, so a longer timeout value does not impact the performance
normally. I guess this is the same case for the thread joining timed


> I multiplied the join timeout interval by a factor of 100 and the
> errors went away, with cycle eater running in the background. I don't
> think we want to make changes like this in the VM. This is not a good
> way to tune wall clock times ( some of which need to exist in the
> implementation ).
> I also have some concern about how we are choosing to create these
> test scenarios. Artificial severe stress conditions can be simulated
> in tests creating failures that are time consuming to debug. But I
> don't know how much extra information they give us. For example, we
> already known that daemon thread shutdown is not perfect. If we choose
> to create stresses, I think that it is better to use real applications
> or well known workloads. In that case, failures would be more
> meaningful and would give us some good guidance on tuning things.
> On6/6/07, Vladimir Ivanov <ivavladimir@gmail.com> wrote:
> > issue HARMONY-4080 was created to track it.
> >
> >  thanks, Vladimir
> >
> > On 5/18/07, Vladimir Ivanov <ivavladimir@gmail.com> wrote:
> > > The CC/CI report failures just now on linux x86_64 in default mode:
> > > -----------------------------
> > > Running test : thread.ThreadInterrupt
> > > *** FAILED **** : thread.ThreadInterrupt (139 res code)
> > > -----------------------------
> > >
> > >  thanks, Vladimir
> > >
> > >
> > > On 5/18/07, Rana Dasgupta <rdasgupt@gmail.com> wrote:
> > > > OK, I will also try to change this test to make it more meaningful
> > > > than it is now. We can then decide if we want to keep it or lose it?
> > > >
> > > > On 5/17/07, Pavel Rebriy <pavel.rebriy@gmail.com> wrote:
> > > > > May be better modify tests to the correct way?
> > > > > The test gc.ThreadSuspension check suspension model during garbage
> > > > > collection. It is a very useful test for VM.
> > > > > --
> > > > > Best regards,
> > > > > Pavel Rebriy
> > > > >
> > > >
> > >
> >


View raw message