harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Egor Pasko <egor.pa...@gmail.com>
Subject Re: [drlvm][threading] H3010 (Stack Overflow Exception) -- when does this bug really have to be fixed?
Date Tue, 13 Mar 2007 10:44:25 GMT
On the 0x297 day of Apache Harmony Weldon Washburn wrote:
> On 12 Mar 2007 21:52:45 +0300, Egor Pasko <egor.pasko@gmail.com> wrote:
> >
> > On the 0x297 day of Apache Harmony Weldon Washburn wrote:
> > > On 12 Mar 2007 19:46:06 +0300, Egor Pasko <egor.pasko@gmail.com> wrote:
> > > >
> > > > On the 0x297 day of Apache Harmony Weldon Washburn wrote:
> > > > > All,
> > > > > I assigned H3010 to myself.  This test definitely demonstrates a
bug
> > > > that
> > > > > needs fixing.  But its not clear when this bug must be fixed.  This
> > > > really
> > > > > brings forward a higher-level.  What to code this bug right now and
> > when
> > > > > would this bug be moved to "blocker" status?  I provide some
> > > > observations to
> > > > > start the discussion:
> > > > >
> > > > > 1)
> > > > > The bug is a Stack Overflow Exception happens from inside fast
> > native
> > > > helper
> > > > > functions.  Fast native helpers do not setup the M2N stack frame
> > which
> > > > is
> > > > > required to throw exceptions such as SOE.  Adding M2N setup to fast
> > > > native
> > > > > helper will unacceptably slow down the system.
> > > >
> > > > to be honest..
> > > >
> > > > SOE can happen from a 'push' onto stack (such pushes are not
> > > > safepoints in JIT currently). Thus, you cannot unwind properly (no M2N
> > > > necessary for releasing the lock).
> > > >
> > > > Do you think it is a low probability?
> > >
> > >
> > > Good point.  Yes, SOE can happen from jitted code doing stuff like "push
> > > ebp".  And we have to handle this case properly.  And it will require a
> > > design discussion between JIT and VM developers.  This is really
> > interesting
> > > topic.  But the question remains.  Do we have to solve this issue in Q1?
> > > Q4?  2008??  To answer this question, we have to ask what workloads we
> > want
> > > to run in Q1/Q2/Q3...  And then find out if the workloads hit the SOE
> > > problem we are discussing.  My guess is that if useful workloads we want
> > to
> > > run actually hit SOE, we will be able to workaround it by simply making
> > the
> > > stack a little bigger.  Also my guess is that Java compatibility tests
> > > (tck?) will specifically test this case.  In other words, its probably
> > > needed for compliance but not really needed for getting important
> > workloads
> > > running.
> >
> > that has some relevence to the -Xss option. If we implement it, almost
> > any "popular workload" would crash in SEGV instead of throwing SOE
> > properly when run on a small stack size.
> >
> > One might argue that running a "popular workload" with a small stack
> > size makes the workload "not so popular". I dunno.
> 
> 
> I understand your argument.  It makes perfect sense.  But the question
> remains.  Is this a bug that has to be fixed in Q2 or in 2008?  Is it
> acceptable to simply bump up the stack size to get Q2 workloads running?

Weldon, I should agree with you, this is not Q2/07 according to our
Q2_milestone draft. But I would address this issue not later than
Q4/2007 because it affects important design decisions that should be
implemented _before_ any attempt to pass serious conformance tests.

Now I tend to think that milestoning is not a perfect thing. There is
a pretty cool agile approach that might be more effective for us (and
what Geir probably talks about):
1. regularly update the TODO list
2. prioritize it, agree on priorities in the project-wide
3. assign highest priority tasks to people (keep the list of task-to-person on Wiki)
4. set requirements for the next release (according to the list of assignments)
5. when these tasks are finished and integrated we fire up the release
6. goto 1.

maybe, new thread for this?

> > > 2)
> > > > > When running useful workload, a Stack Overflow that hits precisely
> > on a
> > > > fast
> > > > > native has a very low probability.  Note the test in H3010
> > specifically
> > > > > forces this event to happen with a very high probability.  In other
> > > > words,
> > > > > while the test is a good, it reflects a very rare event in nature.
> > > > >
> > > > > Given the above, how about we address fixing the problem in two
> > stages:
> > > > >
> > > > > 1)
> > > > > First stage: add an "assert(zero);" to the exception handler when
it
> > is
> > > > > determined an SOE has happened inside a fast native.  This way, we
> > will
> > > > find
> > > > > out quickly when an important workload hits this bug.  Once the
> > > > assert(zero)
> > > > > is added, we code H3010 as "later"
> > > > >
> > > > > 2)
> > > > > Second stage: When an application we care about hits the
> > assert(zero),
> > > > we
> > > > > recode H3010 as "major/blocker".
> > > > >
> > > > > 3)
> > > > > While waiting for #2 above to happen, we discuss on harmony-dev ways
> > of
> > > > > designing the right fix.  For starts,  I think we should investigate
> > a
> > > > > design where the exception handler rewrites the entire register
> > context
> > > > so
> > > > > that returning from exception handler revectors the instruction
> > pointer
> > > > to
> > > > > recovery code that will somehow push the M2N frame on the stack and
> > call
> > > > > proper SOE throwing code.  I have not looked closely at how to do
> > > > this.  I
> > > > > am not convinced this approach will work.  However, I do think its
> > worth
> > > > a
> > > > > try.  Thoughts?
> > > >
> > > > --
> > > > Egor Pasko
> > > >
> > > >
> > >
> > >
> > > --
> > > Weldon Washburn
> > > Intel Enterprise Solutions Software Division
> >
> > --
> > Egor Pasko
> >
> >
> 
> 
> -- 
> Weldon Washburn
> Intel Enterprise Solutions Software Division

-- 
Egor Pasko


Mime
View raw message