hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binglin Chang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-9916) Race condition in ipc.Client causes TestIPC timeout
Date Sun, 01 Sep 2013 17:24:52 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Binglin Chang updated HADOOP-9916:
----------------------------------

    Description: 
TestIPC timeouts occasionally, for example: 
[https://issues.apache.org/jira/browse/HDFS-5130?focusedCommentId=13749870&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13749870]
[https://issues.apache.org/jira/browse/HADOOP-9915?focusedCommentId=13753302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13753302]

Look into the code, there is race condition in oah.ipc.Client, the race condition happen between
RPC call thread and connection read response thread:

{code}
        if (status == RpcStatusProto.SUCCESS) {
          Writable value = ReflectionUtils.newInstance(valueClass, conf);
          value.readFields(in);                 // read value
          call.setRpcResponse(value);
          calls.remove(callId);
{code}

Read Thread: 
Connection.receiveRpcResponse-> call.setRpcResponse(value) -> notify Call Thread

Call Thread:
Client.call -> Connection.addCall(retry with the same callId) -> notify read thread

Read Thread:
calls.remove(callId) # intend to remove old call, but removes newly added call...
Connection.waitForWork end up wait maxIdleTime and close the connection. The call never get
response and dead.

The problem doesn't show because previously callId is unique, we never accidentally remove
newly added calls, but when retry added this race condition became possible.

To solve this, we can simply change order, remove the call first, then notify call thread.
Note there are many places need this order change(normal case, error case, cleanup case)

And there are some minor issues in TestIPC:
1. there are two method with same name:
   void testSerial()
   void testSerial(int handlerCount, boolean handlerSleep, ...)
   the second is not a test case(so should not use testXXX prefix), but somehow it causes
testSerial(first one) run two times, see test report:
{code}
  <testcase time="26.896" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
  <testcase time="25.426" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
{code}

2. timeout annotation should be added, so next time related log is available.



  was:
TestIPC timeouts occasionally, for example: 
[https://issues.apache.org/jira/browse/HDFS-5130?focusedCommentId=13749870&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13749870]
[https://issues.apache.org/jira/browse/HADOOP-9915?focusedCommentId=13753302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13753302]

Look into the code, have not found the cause yet, but we can do something first:

1. there are two method with same name:
   void testSerial()
   void testSerial(int handlerCount, boolean handlerSleep, ...)
   the second is not a test case, but somehow it causes testSerial(first one) run two times,
see test report:
{code}
  <testcase time="26.896" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
  <testcase time="25.426" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
{code}

2. timeout annotation should be added, so next time related log is available.



        Summary: Race condition in ipc.Client causes TestIPC timeout  (was: TestIPC timeouts
occasionally)
    
> Race condition in ipc.Client causes TestIPC timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-9916
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9916
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>            Priority: Minor
>
> TestIPC timeouts occasionally, for example: 
> [https://issues.apache.org/jira/browse/HDFS-5130?focusedCommentId=13749870&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13749870]
> [https://issues.apache.org/jira/browse/HADOOP-9915?focusedCommentId=13753302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13753302]
> Look into the code, there is race condition in oah.ipc.Client, the race condition happen
between RPC call thread and connection read response thread:
> {code}
>         if (status == RpcStatusProto.SUCCESS) {
>           Writable value = ReflectionUtils.newInstance(valueClass, conf);
>           value.readFields(in);                 // read value
>           call.setRpcResponse(value);
>           calls.remove(callId);
> {code}
> Read Thread: 
> Connection.receiveRpcResponse-> call.setRpcResponse(value) -> notify Call Thread
> Call Thread:
> Client.call -> Connection.addCall(retry with the same callId) -> notify read thread
> Read Thread:
> calls.remove(callId) # intend to remove old call, but removes newly added call...
> Connection.waitForWork end up wait maxIdleTime and close the connection. The call never
get response and dead.
> The problem doesn't show because previously callId is unique, we never accidentally remove
newly added calls, but when retry added this race condition became possible.
> To solve this, we can simply change order, remove the call first, then notify call thread.
> Note there are many places need this order change(normal case, error case, cleanup case)
> And there are some minor issues in TestIPC:
> 1. there are two method with same name:
>    void testSerial()
>    void testSerial(int handlerCount, boolean handlerSleep, ...)
>    the second is not a test case(so should not use testXXX prefix), but somehow it causes
testSerial(first one) run two times, see test report:
> {code}
>   <testcase time="26.896" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
>   <testcase time="25.426" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
> {code}
> 2. timeout annotation should be added, so next time related log is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message