harmony-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "weldon washburn (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HARMONY-4215) [drlvm][reliability] assert in JIT while unwinding stack (during enumeration)
Date Sat, 15 Sep 2007 00:16:32 GMT

    [ https://issues.apache.org/jira/browse/HARMONY-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527662
] 

weldon washburn commented on HARMONY-4215:
------------------------------------------

Evgeuni, 
Thanks for the analysis.  It is good work.  A few of your discoveries are new to me.  And
a few of the below items actually hit on the key thread state design issues documented in
Harmony-4716.  Inline comments are below.

I want to share with you my findings during HARMONY-4215 investigation. In total I found 8
issues in TM component. By fixing/workarounding these issues I managed to pass HARMONY-4215.
I'm not sure if all of them are unknown. 
  

1) Problem state: New threads may appear and switch to suspend disabled state (so performing
dangerous operations) during GC cycle. That may happen because hythread_suspend_all() releases
the global lock which guards attachment of new threads. 

[Washburn, Weldon] I did not realize this.  We have to find out if this will cause deadlock.
 This looks like it should be part of the Thread State Transition project.


Possible solution: Keep global lock until hythread_resume_all() is called. 

  

2) Problem state: Race condition in vm_attach(). Last action of jthread_allocate_vm_thread
is to associate native with vm_thread. From that moment this thread becomes available for
GC enumeration. But by that time it's not yet fully initialized. For example M2N frame is
not allocated and initialized. Possible solution: Native and VM thread association should
be last action with atomic effect. 

[Washburn, Weldon] I think we just fixed this bug in a recent Thread Lifecycle commit.
 

3) Problem state: Race condition in hythread_suspend_disable(). This one was mentioned by
Weldon on the list. But only one side of the problem was discussed. To guarantee correct work
of suspension machinery two conditions should be met. The first one is that suspender thread
MUST observe exactly the same value of disable_count as it is assigned before suspendee thread
reads suspend_request flag. The second one is that suspendee thread MUST increment disable_count
before it reads suspend_request flag. 

Possible solution: Add memory read/write barrier between write operation to disable_count
and read operation from suspend_request. 

[Washburn, Weldon]  This is a major concern.  My current thoughts are covered in JIRA H4716.
 It looks like this one should be part of the Thread State Transition project.  Briefly there
are two big concerns.  One, does SFENCE opcode make a guarantee that a write will appear in
the coherency domain before the next instruction is retired?  I think the answer is yes but
the Intel documentation is vague.  Two, what is the performance impact of an SFENCE?  We tried
to model exactly the above with lockcmpxchg but found the performance unacceptable.  Also
note the proposed patch uses apr_memory_rw_barrier() which apparently is defined to be   
__asm {lock add [esp], 0 }   In other words, ...rw_barrier does not use SFENCE.  My guess
is that while a lockcmpxchg is a superset of SFENCE, it has a performance penalty.


4) Problem state: Potentially illegal transition from suspend disabled to suspend enabled
mode in hythread_global_lock. When some thread is trying to acquire TM's global lock in disabled
state it will be quietly switched in enabled state if GC wants to happen. But such transition
is illegal if thread already performed "unsafe" operations and is not ready for GC to happen.
This issue affects overall stability in a negative way and can be hard to identify. 

Possible solution: I think the best choice here to follow the rule that TM's global lock should
be grabbed before switching to disabled state. 

[Washburn, Weldon] hmm... Maybe.  There could be deadlock and performance impact.  Note that
code outside the Thread Manager that misuses Thread Manager interfaces, while absolutely undesirable,
is not really a Thread Manager internals problem.  In other words, this is definitely a serious
problem that needs to be addressed seperately from the Thread Manager problems.

  

5) Problem state: Race condition in hythread_global_lock. At the very end of this function
disable_count is assigned to its original value. Another thread may send suspend request to
the current and check disable_count (which is zero) just after current thread have read suspend_request
(read value is zero) flag and changed disable_count. Thus suspender thread thinks that it
successfully suspended target thread and target thread in the same time enters unsafe region.


Possible solution: Use hythread_set_suspend_disable to change the value of disable_count which
takes care about described situation. Unfortunately, this solution is not perfect because
it opens small window when deadlock can happened. This is exactly the same type of deadlock
which current procedure is trying to avoid. So the best solution will be the same as described
in item 4. 

[Washburn, Weldon]   This looks like it should be part of the Thread State Transition project.


6) Problem state: Race condition in thread_safe_point_impl(). The problem is exactly the same
as described in item 5. The very last assignment to disable_count can be noted by suspender
thread to late and the thread will enter unsafe region while suspender thinks it's ok to procced.


Possible solution: Use hythread_set_suspend_disable to change the value of disable_count.


[Washburn, Weldon]  Yes.  This is a key problem that Thread State Transition project intends
to address.  Please see H4716.
  

7) Problem state: Race condition in hythread_set_suspend_disable. See item 2 for description
and solution. 

8) Problem state: Minor concern regarding thread attach/detach procedure. It seems to be reasonable
from stability POV to hold TM's global lock during the whole attach/detach period. That can
help to avoid nasty race conditions like mentioned in item 2. So in the attached patch the
problem 2. is fixed/workarounded this way. 

[Washburn, Weldon] hmm.... I think the code already does what is suggested.  I have created
a patch for vm shutdown.  Let's wait and see what bug(s) remain once we get Thread Lifecycle
and vm shutdown patches committed.
  

I didn't test attached patch thoroughly. I've only checked that it fixes HARMONY-4215. I also
didn't check performance impact.      Washburn, Weldon] OK.  

So don't take it as final rather just as a demonstration of problems. 

[Washburn, Weldon] Good idea.  Your work has been very helpful.  Hopefully we fix these bugs
during Q4 when the Thread State Transition project get going.




> [drlvm][reliability] assert in JIT while unwinding stack (during enumeration)
> -----------------------------------------------------------------------------
>
>                 Key: HARMONY-4215
>                 URL: https://issues.apache.org/jira/browse/HARMONY-4215
>             Project: Harmony
>          Issue Type: Bug
>          Components: DRLVM
>         Environment: Win 2003
>            Reporter: Aleksey Ignatenko
>            Assignee: weldon washburn
>         Attachments: 2rel_tests_fail.dmp, H4215_tm_races.patch, test.zip, util.Date_int.out.err
>
>
> 2 reliablity tests failed on assertion in jit:
> 1. CharsetSyncCacheTest
> 2. ZlibTest
> svn rev 548320
> Call stack:
>  	jitrino.dll!_assert(const char * expr=0x01ba9f08, const char * filename=0x01ba9eb0,
unsigned int lineno=0x000000ac)  Line 295	C
> >	jitrino.dll!Jitrino::Ia32::StackInfo::read(Jitrino::MethodDesc * pMethodDesc=0x03bef720,
unsigned int eip=0x00000000, bool isFirst=false)  Line 172 + 0x1a	C++
>  	jitrino.dll!Jitrino::Ia32::RuntimeInterface::unwindStack(Jitrino::MethodDesc * methodDesc=0x03bef720,
JitFrameContext * context=0x02de90d4, bool isFirst=false)  Line 40	C++
>  	jitrino.dll!Jitrino::Jitrino::UnwindStack(Jitrino::MethodDesc * methodDesc=0x03bef720,
JitFrameContext * context=0x02de90d4, bool isFirst=false)  Line 280 + 0x1e	C++
>  	jitrino.dll!JIT_unwind_stack_frame(void * jit=0x012a6a48, Method * method=0x02cb3668,
JitFrameContext * context=0x02de90d4)  Line 362 + 0x18	C++
>  	harmonyvm.dll!Dll_JIT::unwind_stack_frame(Method * method=0x02cb3668, JitFrameContext
* context=0x02de90d4)  Line 94 + 0x14	C++
>  	harmonyvm.dll!si_goto_previous(StackIterator * si=0x02de90d0, bool over_popped=false)
 Line 315 + 0x32	C++
>  	harmonyvm.dll!vm_enumerate_root_set_single_thread_on_stack(StackIterator * si=0x02de90d0)
 Line 339 + 0xb	C++
>  	harmonyvm.dll!vm_enumerate_thread(VM_thread * thread=0x02d7e088)  Line 224 + 0x9	C++
>  	harmonyvm.dll!stop_the_world_root_set_enumeration()  Line 110 + 0xc	C++
>  	harmonyvm.dll!vm_enumerate_root_set_all_threads()  Line 150	C++
>  	gc_gen.dll!gc_reclaim_heap(GC * gc=0x014d5de8, unsigned int gc_cause=0x00000003)  Line
295 + 0x8	C++
>  	gc_gen.dll!gc_force_gc()  Line 138 + 0xd	C++
>  	harmonyvm.dll!Java_java_lang_VMMemoryManager_runGC(JNIEnv_External * __formal=0x02d9e240,
JNIEnv_External * __formal=0x02d9e240)  Line 138 + 0x8	C++
>  	03bf33ef()	
>  	harmonyvm.dll!get_vm_thread(HyThread * thr=0x03bef9ec)  Line 193 + 0xb	C++
>  	harmonyvm.dll!get_vm_thread(HyThread * thr=0x00000001)  Line 194 + 0x7	C++
> To reproduce onee needs to run these tests with 30 min cycle (set     <property name="test.duration"
value="1800" />  parameter in build.xml of reliablity tests).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message