harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexey Varlamov" <alexey.v.varla...@gmail.com>
Subject Re: [drlvm] stress.Mix / MegaSpawn threading bug
Date Fri, 12 Jan 2007 05:51:23 GMT
2007/1/11, Gregory Shimansky <gshimansky@gmail.com>:
> Geir Magnusson Jr. wrote:
> >
> > On Jan 10, 2007, at 9:00 AM, Gregory Shimansky wrote:
> >
> >> Geir Magnusson Jr. wrote:
> >>>> I think the same problem may happen on Linux because it spills out
> >>>> OOMEs on Ubuntu as well.
> >>>>
> >>>> If somehow test doesn't crash on failed mallocs and gets to the
> >>>> shutdown stage and hangs with 2 or more dead locked threads. So far
> >>>> I didn't quite understand how they lock each other.
> >>> Cool - thanks.  If you have a free second, could you note this on the
> >>> wiki page so we don't forget?
> >>
> >> I think it is better to track this with JIRA. AFAIU is not a stress
> >> conditions issue and so it is a normal bug which should be found and
> >> fixed. I created a new JIRA HARMONY-2963 which is subtask for
> >> HARMONY-2803 where Weldon attached his MegaSpawn test.
> >
> > Agreed that a JIRA is important - I just wanted to make sure that we
> > added it somehow to the whiteboard so we had a complete picture of
> > things related to this problem.
>
> Today investigation of the hanging threads at shutdown have 2 different
> reasons. 1st one was found by Salikh and he wrote his comments in
> HARMONY-2963. The bug happened because the counter of non-daemon threads
> increased before a thread was created. If a thread failed to be created
> because of no memory, this counter was not updated.
Good catch!

> Another reason for hanging threads is that they wait in Thread.start().
> When a new thread is started, it has to notify a lock object, in order
> to signal the parent thread that it has been created. This notification
> is sent from java code of the Thread before user code is executed.

IMO the problem is in this design rather than its impl. The
specification for j.l.Thread.start() does not require that execution
of run() begins strictly before the start() returned. So in my
perception that loop with waiting is unnecessary complication, it is
enough to launch new native thread and let OS schedule.

> But thread manager has some native code too which is ran before java
> code of the newly started thread. This native code tried to set up some
> thread state like new JNI environment and other stuff, and this requires
> allocation of new memory. If allocation of new memory fails, this native
> code of the newly created thread tries to return an error which is not
> seen anywhere (since this is the code which is the first function of the
> new thread), so it is not noticed. But since native code of the new
> thread finishes silently, it never runs the Java code which should do
> monitor notification, so monitor is not notified. So the parent thread
> just waits infinitely.

I suppose a new thread may fail to run Java code for a number of
reasons, e.g. failed JIT compilation, malfunctioning TI agent, etc -
not only due to memory exhaustion. We cannot guarantee hang-free
behavior, this is design-inherent issue.

> To fix this bug I think it is necessary to get rid of error conditions
> in the newly created threads. I think it is necessary to allocate all
> necessary state before a new thread is started, so if these resources
> cannot be allocated, an error should be returned to the parent thread,
> and it won't wait infinitely on new thread start notification.
Well, this would not resolve all possible reasons for the infinite
wait, as noted above. Besides, this creates another kind of memory
waste/leak if a thread is never run. My proposal is opposite, do all
native allocations no earlier than start() is invoked, naturally throw
OOME if any, and rely on OS scheduler to run the new thread
concurrently.

BTW, AFAIU in the current design TM never frees native memory
allocated for new thread data, though sometimes reuses it for
collected threads. There are related HARMONY-2437 and HARMONY-2742. I
don't think this is the reason of MegaSpawn crashes, but yet another
issue of the kind.

>
> --
> Gregory
>
>

Mime
View raw message