hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4659) Root cause of connection failure is being lost to code that uses it for delaying startup
Date Tue, 25 Nov 2008 13:55:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650567#action_12650567

Steve Loughran commented on HADOOP-4659:

I'm going to push out my updated lifecycle patches shortly. One test I have there brings up
a tasktracker without the rest of the infrastructure (DFS, jobtracker); it is now hanging
until the test times out, spinning while things get set up, waiting for a job tracker that
never arrives.

    [junit] Tue Nov 25 13:50:13 2008
    [junit] BEA JRockit(R) R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64
    [junit] "Main Thread" id=1 idx=0x4 tid=4074 prio=5 alive, in native, sleeping, native_waiting
    [junit]     at java/lang/Thread.sleep(J)V(Native Method)
    [junit]     at org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:364)
    [junit]     at org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:310)
    [junit]     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1184b18[thin lock]
    [junit]     at org/apache/hadoop/ipc/Client$Connection.access$1800(Client.java:177)
    [junit]     at org/apache/hadoop/ipc/Client.getConnection(Client.java:792)
    [junit]     at org/apache/hadoop/ipc/Client.call(Client.java:688)
    [junit]     at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:215)
    [junit]     at org/apache/hadoop/mapred/$Proxy0.getProtocolVersion(Ljava/lang/String;J)J(Unknown
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:347)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:334)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:371)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:308)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:285)
    [junit]     at org/apache/hadoop/mapred/TaskTracker.initialize(TaskTracker.java:454)
    [junit]     ^-- Holding lock: org/apache/hadoop/mapred/TaskTracker@0x34c4748[recursive]
    [junit]     at org/apache/hadoop/mapred/TaskTracker.innerStart(TaskTracker.java:830)
    [junit]     ^-- Holding lock: org/apache/hadoop/mapred/TaskTracker@0x34c4748[thin lock]
    [junit]     at org/apache/hadoop/util/Service.start(Service.java:186)
    [junit]     at org/apache/hadoop/util/Service.deploy(Service.java:654)
    [junit]     at org/apache/hadoop/mapred/TaskTracker.<init>(TaskTracker.java:965)
    [junit]     at org/apache/hadoop/mapred/TaskTracker.<init>(TaskTracker.java:948)

What I propose here is to move TaskTracker to have a timeout on its waitForProxy() operation,
so that if the TT comes up before the JT, there's a bit of leeway, but eventually the TT will
conclude that it is an orphan and that it cannot start up

> Root cause of connection failure is being lost to code that uses it for delaying startup
> ----------------------------------------------------------------------------------------
>                 Key: HADOOP-4659
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4659
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.18.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Blocker
>             Fix For: 0.18.3
>         Attachments: connectRetry.patch, hadoop-4659.patch, hadoop-4659.patch, rpcConn.patch,
> ipc.Client the root cause of a connection failure is being lost as the exception is wrapped,
hence the outside code, the one that looks for that root cause, isn't working as expected.
The results is you can't bring up a task tracker before job tracker, and probably the same
for a datanode before a  namenode. The change that triggered this is not yet located, I had
thought it was HADOOP-3844 but I no longer believe this is the case.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message