harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evgueni Brevnov" <evgueni.brev...@gmail.com>
Subject Re: [drlvm] stress.Mix / MegaSpawn threading bug
Date Thu, 11 Jan 2007 05:04:40 GMT
On 1/11/07, Rana Dasgupta <rdasgupt@gmail.com> wrote:
> On 1/10/07, Geir Magnusson Jr. <geir@pobox.com> wrote:
> >
> >
> > On Jan 10, 2007, at 2:13 PM, Weldon Washburn wrote:
> >
> > >> 1)
> > >> In some earlier posting, it was mentioned that somehow the virtual
> > >> memory
> > >> address space is impacted by how much physical memory is in a given
> > >> computer.  Actually this is not true.  The virtual address space
> > >> available
> > >> to the JVM is fixed by the OS.  A machine with less phys mem will
> > >> do more
> > >> disk I/O.   In other words "C" malloc() hard limits are set by OS
> > >>version
> > >> number not by RAM chips.
> > >>
> >
> > >Talking about VM vs RAM vs whatever is a red herring - we may be
> > >ported to a machine w/o virtual memory.  What matters is that when
> > >malloc() returns null, we do something smart.  At least, do nothing
> > >harmful.
>
>
> There can be no machine without virtual memory on any of the OS's of
> interest to us. VM is not a type of memory technology. What Weldon, Gregory
> and several others have pointed out is that if one keeps on consuming
> virtual address space by allocating space for thread stacks, the address
> space will eventually run out, and the process will be a fatal state
> independent of what is the physical memory on the machine.
>
> >> 2)
> > >> Why not simply hard code DRLVM to throw an OOME whenever there are
> > >> more than
> > >> 1K threads running?  I think Rana first suggested this approach.
> > >> My guess
> > >> is that 1K threads is good enough to run lots of interesting
> > >> workloads.  My
> > >> guess is that common versions of WinXP and Linux will handle the C
> > >> malloc()
> > >> load of 1K threads successfully.  If not, how about trying 512
> > >> threads?
> >

 -1 for hard coding max number of threads.

> > >Because this is picking up the rug, and sweeping all the dirt
> > >underneath it.  The core problem isn't that we try too many threads,
> > >but the code wasn't written defensively.  Putting an artificial limit
> > >on # of threads just means that we'll hit it somewhere else, in some
> > >other resource usage.
> >
> > >I think we should fix it.
>
>
> Sure. The way to fix a fatal error is to leave room for a process to recover
> from it or handle it. Another example of a fatal error is a Stack overflow
> or a TerminateProcess signal. In the case of Stack overflow, we handle it by
> trying to raise the exception while some room is left of the stack so that
> there is a fair chance to handle. Similarly, an approach could be to set a
> limit on the maximum number of threads we create. Based on the memory we
> give each thread stack we can choose a limit which we estimate will leave us
> room to handle the error.
>
> >>There seem to be some basic things we can do, like reduce the stack
> > >>size on windows from the terabyte or whatever it is now, to the
> > >>number that our dear, esteemed colleague from IBM claims is perfectly
> > >>suitable for production use.
> >
> > >That too doesn't solve the problem, but it certainly fixes a problem
> > >we are now aware of - our stack size is too big.... :)
>
>
> The best size to set for the thread stack is a valid issue, and it is useful
> information to know what the IBM VM sets. Google searches also seem to show
> that threadstack size on J9 is user configurable. But even with smaller
> stack sizes, if one ran Megaspawn for sufficiently long time, we would get
> the same error. So we can't have unbounded stresses like this or, the VM
> needs to bound the resources consumable by such a test. Also, we cannot just
> emulate what the IBM VM does in one specific area without understanding all
> their entire design. For example, a small stack size will cause Stack
> Overflow exceptions to happen early. We need to tune these sizes based on
> our own experiments.
>
> >>
> > >> 3)
> > >> The above does not deal with the general architecture question of
> > >> handling C
> > >> malloc failures.  This is far harder to solve.  Note that solving
> > >> the big
> > >> question will also require far more extensive regression tests than
> > >> MegaSpawn.  However, it does fix DRLVM so that it does not crash/
> > >> burn on
> > >> threads overload.  This, in turn, gives us time to fix the real
> > >> underlying
> > >> problem(s) with C malloc.
>
>
> I think that we should defer this part, it is a dificult problem and there
> are several potential approaches based on what kind of reliable computing
> contracts we want to expose. For example, one can think of a contract
> that no fatal failures( OOME, stack overflow, thread abort ) happen
> in marked regions of code, ever. I don't think that we need to solve this
> hard problem right now.
>
>

Mime
View raw message