hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4659) Root cause of connection failure is being lost to code that uses it for delaying startup
Date Tue, 18 Nov 2008 14:03:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648600#action_12648600
] 

Steve Loughran commented on HADOOP-4659:
----------------------------------------

The two patches in here do slightly different things, and need to be merged. 

-my one left the class of an exception alone (and, for ConnectExceptions, inserted the host
and port into the text, as with HADOOP-3844.
-Hairong's stopped some exceptions getting swallowed during setupIOstreams

I think both are needed, so will apply Hairong's to my code and generate a combined patch.


I can test this by deploying an orphan task tracker, but in that situation, once the code
is fixed, the TaskTracker will spin forever. If a timeout on the retries could be provided,
we could add a test that verified the tracker ran for 20-30s before timing out and relaying
the exception. In production you'd set the timeout to a number of hours or forever, obviously.

> Root cause of connection failure is being lost to code that uses it for delaying startup
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4659
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4659
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.18.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: connectRetry.patch, hadoop-4659.patch
>
>
> ipc.Client the root cause of a connection failure is being lost as the exception is wrapped,
hence the outside code, the one that looks for that root cause, isn't working as expected.
The results is you can't bring up a task tracker before job tracker, and probably the same
for a datanode before a  namenode. The change that triggered this is not yet located, I had
thought it was HADOOP-3844 but I no longer believe this is the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message