harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rana Dasgupta" <rdasg...@gmail.com>
Subject Re: [drlvm] run of smoke tests on overloaded box
Date Sat, 16 Jun 2007 02:58:26 GMT
I could repro a couple of these cases finally( on Linux x64 ) and I
think that this problem is happening because of the known weakness in
shutdown of daemon threads.  gdb shows a SIGSEGV in the cancel
handler, usually reporting a zombie thread.

In the current shutdown we register safepoint shutdown callbacks and
do timed joins, waiting for the daemon threads to exit. We make some
reasonable guess on the join timeout interval. After this, we kill the
threads ( on Linux with a pthread_cancel ). When the cycle eater runs
in the background, the join interval we have chosen is not enough. But
sometimes, between the time we give up on the joins and before we post
the cancel signals, the thread( default attribute is joinable and not
detached thread ) finally completes the safepoint shutdown callback
and exits. It is now a zombie or whatever, and would release all
resources on join. But in shutdown we have given up on join and has
started pthread_cancel(). The CANCEL signal fails to handle on the
zombie thread and raises SIGSEGV. I don't know Linux well enough to
know the exact dynamics of zombies.

I multiplied the join timeout interval by a factor of 100 and the
errors went away, with cycle eater running in the background. I don't
think we want to make changes like this in the VM. This is not a good
way to tune wall clock times ( some of which need to exist in the
implementation ).

I also have some concern about how we are choosing to create these
test scenarios. Artificial severe stress conditions can be simulated
in tests creating failures that are time consuming to debug. But I
don't know how much extra information they give us. For example, we
already known that daemon thread shutdown is not perfect. If we choose
to create stresses, I think that it is better to use real applications
or well known workloads. In that case, failures would be more
meaningful and would give us some good guidance on tuning things.

On6/6/07, Vladimir Ivanov <ivavladimir@gmail.com> wrote:
> issue HARMONY-4080 was created to track it.
>
>  thanks, Vladimir
>
> On 5/18/07, Vladimir Ivanov <ivavladimir@gmail.com> wrote:
> > The CC/CI report failures just now on linux x86_64 in default mode:
> > -----------------------------
> > Running test : thread.ThreadInterrupt
> > *** FAILED **** : thread.ThreadInterrupt (139 res code)
> > -----------------------------
> >
> >  thanks, Vladimir
> >
> >
> > On 5/18/07, Rana Dasgupta <rdasgupt@gmail.com> wrote:
> > > OK, I will also try to change this test to make it more meaningful
> > > than it is now. We can then decide if we want to keep it or lose it?
> > >
> > > On 5/17/07, Pavel Rebriy <pavel.rebriy@gmail.com> wrote:
> > > > May be better modify tests to the correct way?
> > > > The test gc.ThreadSuspension check suspension model during garbage
> > > > collection. It is a very useful test for VM.
> > > > --
> > > > Best regards,
> > > > Pavel Rebriy
> > > >
> > >
> >
>

Mime
View raw message