harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evgueni Brevnov" <evgueni.brev...@gmail.com>
Subject Re: [drlvm] run of smoke tests on overloaded box
Date Mon, 18 Jun 2007 07:03:33 GMT
On 6/16/07, Xiao-Feng Li <xiaofeng.li@gmail.com> wrote:
> On 6/16/07, Rana Dasgupta <rdasgupt@gmail.com> wrote:
> > I could repro a couple of these cases finally( on Linux x64 ) and I
> > think that this problem is happening because of the known weakness in
> > shutdown of daemon threads.  gdb shows a SIGSEGV in the cancel
> > handler, usually reporting a zombie thread.
> >
> > In the current shutdown we register safepoint shutdown callbacks and
> > do timed joins, waiting for the daemon threads to exit. We make some
> > reasonable guess on the join timeout interval. After this, we kill the
> > threads ( on Linux with a pthread_cancel ). When the cycle eater runs
> > in the background, the join interval we have chosen is not enough. But
> > sometimes, between the time we give up on the joins and before we post
> > the cancel signals, the thread( default attribute is joinable and not
> > detached thread ) finally completes the safepoint shutdown callback
> > and exits. It is now a zombie or whatever, and would release all
> > resources on join. But in shutdown we have given up on join and has
> > started pthread_cancel(). The CANCEL signal fails to handle on the
> > zombie thread and raises SIGSEGV. I don't know Linux well enough to
> > know the exact dynamics of zombies.
>
> Very interesting study. This situation happens not only here but also
> finalizer threads shutdown. We have test case that creates infinite
> loop execution in a finalizer (or waiting on a lost socket), requiring
> the system can shutdown correctly by sort of figuring out this
> situation and not waiting for the (dead) finalizer's finish. At the
> same time, we have test case that lets the finalizer to run lots of
> heavy duty work, and requiring the system to figure out this situation
> and waiting for the finalizer's finish.
>
> In GCv5, we solved the problem (or passed the tests anyway) by letting
> the system to timed wait on the finalizers. If at the timeout event we
> detect there is at least one finalizer is executed, we will loop back
> timed waiting again, since in this case it means the finalizers are
> still making progress. If at a timeout event we find the finalizers
> number is unchanged, we decide the finalizers are dead and will go on
> to exit.

Does the spec require to wait for finalizer thread completion? My
understanding is that finalizer thread should be treated as general
daemon thread until runFinalizationOnExit requested. That means that
we should not wait for finalize thread completion and can shutdown it
at any time.

Evgueni

>
> The problem is, we don't know which timeout value is reasonable, 1ms
> or 1s. In this case, I personally think a bigger value makes more
> sense. Since in our case, the timed wait doesn't need to wait for
> timeout, it can also be waken up by the finalizers once they are
> finished, so a longer timeout value does not impact the performance
> normally. I guess this is the same case for the thread joining timed
> wait?
>
> Thanks,
> xiaofeng
>
> > I multiplied the join timeout interval by a factor of 100 and the
> > errors went away, with cycle eater running in the background. I don't
> > think we want to make changes like this in the VM. This is not a good
> > way to tune wall clock times ( some of which need to exist in the
> > implementation ).
> >
> > I also have some concern about how we are choosing to create these
> > test scenarios. Artificial severe stress conditions can be simulated
> > in tests creating failures that are time consuming to debug. But I
> > don't know how much extra information they give us. For example, we
> > already known that daemon thread shutdown is not perfect. If we choose
> > to create stresses, I think that it is better to use real applications
> > or well known workloads. In that case, failures would be more
> > meaningful and would give us some good guidance on tuning things.
> >
> > On6/6/07, Vladimir Ivanov <ivavladimir@gmail.com> wrote:
> > > issue HARMONY-4080 was created to track it.
> > >
> > >  thanks, Vladimir
> > >
> > > On 5/18/07, Vladimir Ivanov <ivavladimir@gmail.com> wrote:
> > > > The CC/CI report failures just now on linux x86_64 in default mode:
> > > > -----------------------------
> > > > Running test : thread.ThreadInterrupt
> > > > *** FAILED **** : thread.ThreadInterrupt (139 res code)
> > > > -----------------------------
> > > >
> > > >  thanks, Vladimir
> > > >
> > > >
> > > > On 5/18/07, Rana Dasgupta <rdasgupt@gmail.com> wrote:
> > > > > OK, I will also try to change this test to make it more meaningful
> > > > > than it is now. We can then decide if we want to keep it or lose
it?
> > > > >
> > > > > On 5/17/07, Pavel Rebriy <pavel.rebriy@gmail.com> wrote:
> > > > > > May be better modify tests to the correct way?
> > > > > > The test gc.ThreadSuspension check suspension model during garbage
> > > > > > collection. It is a very useful test for VM.
> > > > > > --
> > > > > > Best regards,
> > > > > > Pavel Rebriy
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> http://xiao-feng.blogspot.com
>

Mime
View raw message