hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4659) Root cause of connection failure is being lost to code that uses it for delaying startup
Date Wed, 19 Nov 2008 16:43:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649087#action_12649087
] 

Steve Loughran commented on HADOOP-4659:
----------------------------------------

I'm going to put a merged patch up, but although the RPC test is passing, the spinning appears
to be creating deadlock in TestFileCreationClient; relevant bits of the thread dump to follow.

1. We're sleeping here holding Connection@0x2e4f3e0

    [junit] "DataStreamer for file /wrwelkj/file9 block blk_-4298389317957709021_1010" id=133
idx=0x210 tid=25976 prio=5 alive, in native, sleeping, native_waiting, daemon
    [junit]     at java/lang/Thread.sleep(J)V(Native Method)
    [junit]     at org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:373)
    [junit]     at org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:310)
    [junit]     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0[thin lock]
    [junit]     at org/apache/hadoop/ipc/Client$Connection.access$1700(Client.java:177)
    [junit]     at org/apache/hadoop/ipc/Client.getConnection(Client.java:791)
    [junit]     at org/apache/hadoop/ipc/Client.call(Client.java:697)
    [junit]     at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
    [junit]     at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)

2. Which is blocking this
    [junit]     -- Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0[thin
lock]
    [junit]     at jrockit/vm/Threads.sleep(I)V(Native Method)
    [junit]     at jrockit/vm/Locks.waitForThinRelease(Locks.java:1233)[optimized]
    [junit]     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307)[optimized]
    [junit]     at jrockit/vm/Locks.monitorEnter(Locks.java:2389)[optimized]
    [junit]     at org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219)
    [junit]     at org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177)
    [junit]     at org/apache/hadoop/ipc/Client.getConnection(Client.java:785)
    [junit]     at org/apache/hadoop/ipc/Client.call(Client.java:697)
    [junit]     at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
    [junit]     at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)
    [junit]     at org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141)
    [junit]     at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469)
    [junit]     at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997)
    [junit]     at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)

and this

    [junit]     -- Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0[thin
lock]
    [junit]     at jrockit/vm/Threads.sleep(I)V(Native Method)
    [junit]     at jrockit/vm/Locks.waitForThinRelease(Locks.java:1233)[optimized]
    [junit]     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307)[optimized]
    [junit]     at jrockit/vm/Locks.monitorEnter(Locks.java:2389)[optimized]
    [junit]     at org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219)
    [junit]     at org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177)
    [junit]     at org/apache/hadoop/ipc/Client.getConnection(Client.java:785)
    [junit]     at org/apache/hadoop/ipc/Client.call(Client.java:697)
    [junit]     at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
    [junit]     at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)
    [junit]     at org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141)
    [junit]     at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469)
    [junit]     at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997)
    [junit]     at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
    [junit]     ^-- Holding lock: java/util/LinkedList@0x1eb5e20[fat lock]
    [junit]     at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
    [junit]     -- end of trace
    [junit] "DataStreamer for file /wrwelkj/file5 block blk_7479178383257153500_1010" id=127

and this
idx=0x200 tid=25971 prio=5 alive, in native, blocked, daemon
    [junit]     -- Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0[thin
lock]
    [junit]     at jrockit/vm/Threads.sleep(I)V(Native Method)
    [junit]     at jrockit/vm/Locks.waitForThinRelease(Locks.java:1233)[optimized]
    [junit]     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307)[optimized]
    [junit]     at jrockit/vm/Locks.monitorEnter(Locks.java:2389)[optimized]
    [junit]     at org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219)
    [junit]     at org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177)
    [junit]     at org/apache/hadoop/ipc/Client.getConnection(Client.java:785)
    [junit]     at org/apache/hadoop/ipc/Client.call(Client.java:697)
    [junit]     at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
    [junit]     at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)
    [junit]     at org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141)
    [junit]     at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469)
    [junit]     at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997)
    [junit]     at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
    [junit]     ^-- Holding lock: java/util/LinkedList@0x1ea6858[fat lock]
    [junit]     at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)

So: the sleep in setupIOStreams appears to be blocking the other operations. for some reason,
<junit> isn't timing out or killing the process, which implies this is fairly serious.


> Root cause of connection failure is being lost to code that uses it for delaying startup
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4659
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4659
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.18.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: connectRetry.patch, hadoop-4659.patch, rpcConn.patch
>
>
> ipc.Client the root cause of a connection failure is being lost as the exception is wrapped,
hence the outside code, the one that looks for that root cause, isn't working as expected.
The results is you can't bring up a task tracker before job tracker, and probably the same
for a datanode before a  namenode. The change that triggered this is not yet located, I had
thought it was HADOOP-3844 but I no longer believe this is the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message