Mailing-List: contact commits-help@harmony.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@harmony.apache.org
Message-ID: <29995560.1189815392126.JavaMail.jira@brutus>
Date: Fri, 14 Sep 2007 17:16:32 -0700 (PDT)
From: "weldon washburn (JIRA)" <jira@apache.org>
To: commits@harmony.apache.org
Subject: [jira] Commented: (HARMONY-4215) [drlvm][reliability] assert in JIT
 while unwinding stack (during enumeration)
In-Reply-To: <2468776.1182168926030.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HARMONY-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527662 ] 

weldon washburn commented on HARMONY-4215:
------------------------------------------

Evgeuni, 
Thanks for the analysis.  It is good work.  A few of your discoveries are new to me.  And a few of the below items actually hit on the key thread state design issues documented in Harmony-4716.  Inline comments are below.

I want to share with you my findings during HARMONY-4215 investigation. In total I found 8 issues in TM component. By fixing/workarounding these issues I managed to pass HARMONY-4215. I'm not sure if all of them are unknown. 
  

1) Problem state: New threads may appear and switch to suspend disabled state (so performing dangerous operations) during GC cycle. That may happen because hythread_suspend_all() releases the global lock which guards attachment of new threads. 

[Washburn, Weldon] I did not realize this.  We have to find out if this will cause deadlock.  This looks like it should be part of the Thread State Transition project.


Possible solution: Keep global lock until hythread_resume_all() is called. 

  
2) Problem state: Race condition in vm_attach(). Last action of jthread_allocate_vm_thread is to associate native with vm_thread. From that moment this thread becomes available for GC enumeration. But by that time it's not yet fully initialized. For example M2N frame is not allocated and initialized. Possible solution: Native and VM thread association should be last action with atomic effect. 

[Washburn, Weldon] I think we just fixed this bug in a recent Thread Lifecycle commit.
 

3) Problem state: Race condition in hythread_suspend_disable(). This one was mentioned by Weldon on the list. But only one side of the problem was discussed. To guarantee correct work of suspension machinery two conditions should be met. The first one is that suspender thread MUST observe exactly the same value of disable_count as it is assigned before suspendee thread reads suspend_request flag. The second one is that suspendee thread MUST increment disable_count before it reads suspend_request flag. 

Possible solution: Add memory read/write barrier between write operation to disable_count and read operation from suspend_request. 

[Washburn, Weldon]  This is a major concern.  My current thoughts are covered in JIRA H4716.  It looks like this one should be part of the Thread State Transition project.  Briefly there are two big concerns.  One, does SFENCE opcode make a guarantee that a write will appear in the coherency domain before the next instruction is retired?  I think the answer is yes but the Intel documentation is vague.  Two, what is the performance impact of an SFENCE?  We tried to model exactly the above with lockcmpxchg but found the performance unacceptable.  Also note the proposed patch uses apr_memory_rw_barrier() which apparently is defined to be    __asm {lock add [esp], 0 }   In other words, ...rw_barrier does not use SFENCE.  My guess is that while a lockcmpxchg is a superset of SFENCE, it has a performance penalty.


4) Problem state: Potentially illegal transition from suspend disabled to suspend enabled mode in hythread_global_lock. When some thread is trying to acquire TM's global lock in disabled state it will be quietly switched in enabled state if GC wants to happen. But such transition is illegal if thread already performed "unsafe" operations and is not ready for GC to happen. This issue affects overall stability in a negative way and can be hard to identify. 

Possible solution: I think the best choice here to follow the rule that TM's global lock should be grabbed before switching to disabled state. 

[Washburn, Weldon] hmm... Maybe.  There could be deadlock and performance impact.  Note that code outside the Thread Manager that misuses Thread Manager interfaces, while absolutely undesirable, is not really a Thread Manager internals problem.  In other words, this is definitely a serious problem that needs to be addressed seperately from the Thread Manager problems.

  
5) Problem state: Race condition in hythread_global_lock. At the very end of this function disable_count is assigned to its original value. Another thread may send suspend request to the current and check disable_count (which is zero) just after current thread have read suspend_request (read value is zero) flag and changed disable_count. Thus suspender thread thinks that it successfully suspended target thread and target thread in the same time enters unsafe region. 

Possible solution: Use hythread_set_suspend_disable to change the value of disable_count which takes care about described situation. Unfortunately, this solution is not perfect because it opens small window when deadlock can happened. This is exactly the same type of deadlock which current procedure is trying to avoid. So the best solution will be the same as described in item 4. 

[Washburn, Weldon]   This looks like it should be part of the Thread State Transition project.


6) Problem state: Race condition in thread_safe_point_impl(). The problem is exactly the same as described in item 5. The very last assignment to disable_count can be noted by suspender thread to late and the thread will enter unsafe region while suspender thinks it's ok to procced. 

Possible solution: Use hythread_set_suspend_disable to change the value of disable_count. 

[Washburn, Weldon]  Yes.  This is a key problem that Thread State Transition project intends to address.  Please see H4716.
  

7) Problem state: Race condition in hythread_set_suspend_disable. See item 2 for description and solution. 

8) Problem state: Minor concern regarding thread attach/detach procedure. It seems to be reasonable from stability POV to hold TM's global lock during the whole attach/detach period. That can help to avoid nasty race conditions like mentioned in item 2. So in the attached patch the problem 2. is fixed/workarounded this way. 

[Washburn, Weldon] hmm.... I think the code already does what is suggested.  I have created a patch for vm shutdown.  Let's wait and see what bug(s) remain once we get Thread Lifecycle and vm shutdown patches committed.
  

I didn't test attached patch thoroughly. I've only checked that it fixes HARMONY-4215. I also didn't check performance impact.      Washburn, Weldon] OK.  

So don't take it as final rather just as a demonstration of problems. 

[Washburn, Weldon] Good idea.  Your work has been very helpful.  Hopefully we fix these bugs during Q4 when the Thread State Transition project get going.


> [drlvm][reliability] assert in JIT while unwinding stack (during enumeration)
> -----------------------------------------------------------------------------
>
>                 Key: HARMONY-4215
>                 URL: https://issues.apache.org/jira/browse/HARMONY-4215
>             Project: Harmony
>          Issue Type: Bug
>          Components: DRLVM
>         Environment: Win 2003
>            Reporter: Aleksey Ignatenko
>            Assignee: weldon washburn
>         Attachments: 2rel_tests_fail.dmp, H4215_tm_races.patch, test.zip, util.Date_int.out.err
>
>
> 2 reliablity tests failed on assertion in jit:
> 1. CharsetSyncCacheTest
> 2. ZlibTest
> svn rev 548320
> Call stack:
>  	jitrino.dll!_assert(const char * expr=0x01ba9f08, const char * filename=0x01ba9eb0, unsigned int lineno=0x000000ac)  Line 295	C
> >	jitrino.dll!Jitrino::Ia32::StackInfo::read(Jitrino::MethodDesc * pMethodDesc=0x03bef720, unsigned int eip=0x00000000, bool isFirst=false)  Line 172 + 0x1a	C++
>  	jitrino.dll!Jitrino::Ia32::RuntimeInterface::unwindStack(Jitrino::MethodDesc * methodDesc=0x03bef720, JitFrameContext * context=0x02de90d4, bool isFirst=false)  Line 40	C++
>  	jitrino.dll!Jitrino::Jitrino::UnwindStack(Jitrino::MethodDesc * methodDesc=0x03bef720, JitFrameContext * context=0x02de90d4, bool isFirst=false)  Line 280 + 0x1e	C++
>  	jitrino.dll!JIT_unwind_stack_frame(void * jit=0x012a6a48, Method * method=0x02cb3668, JitFrameContext * context=0x02de90d4)  Line 362 + 0x18	C++
>  	harmonyvm.dll!Dll_JIT::unwind_stack_frame(Method * method=0x02cb3668, JitFrameContext * context=0x02de90d4)  Line 94 + 0x14	C++
>  	harmonyvm.dll!si_goto_previous(StackIterator * si=0x02de90d0, bool over_popped=false)  Line 315 + 0x32	C++
>  	harmonyvm.dll!vm_enumerate_root_set_single_thread_on_stack(StackIterator * si=0x02de90d0)  Line 339 + 0xb	C++
>  	harmonyvm.dll!vm_enumerate_thread(VM_thread * thread=0x02d7e088)  Line 224 + 0x9	C++
>  	harmonyvm.dll!stop_the_world_root_set_enumeration()  Line 110 + 0xc	C++
>  	harmonyvm.dll!vm_enumerate_root_set_all_threads()  Line 150	C++
>  	gc_gen.dll!gc_reclaim_heap(GC * gc=0x014d5de8, unsigned int gc_cause=0x00000003)  Line 295 + 0x8	C++
>  	gc_gen.dll!gc_force_gc()  Line 138 + 0xd	C++
>  	harmonyvm.dll!Java_java_lang_VMMemoryManager_runGC(JNIEnv_External * __formal=0x02d9e240, JNIEnv_External * __formal=0x02d9e240)  Line 138 + 0x8	C++
>  	03bf33ef()	
>  	harmonyvm.dll!get_vm_thread(HyThread * thr=0x03bef9ec)  Line 193 + 0xb	C++
>  	harmonyvm.dll!get_vm_thread(HyThread * thr=0x00000001)  Line 194 + 0x7	C++
> To reproduce onee needs to run these tests with 30 min cycle (set     <property name="test.duration" value="1800" />  parameter in build.xml of reliablity tests).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.