hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3881) IPC client doesnt time out if far end handler hangs
Date Fri, 01 Aug 2008 07:56:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618956#action_12618956

Steve Loughran commented on HADOOP-3881:

Yes, retries in this situation would not be ideal. Throwing some exception "timeout invoking
InterTracker.heartbeat() on / -possible deadlock" would be enough for developers.
But production, well, it shouldn't show up. Shall I close this issue as INVALID?

If retry load is an issue then the whole client retry operations in TaskTracker and DataNode
need to be looked at. There's a sleep, with the sleep time hard coded in the source. Which
means that if the whole datacentre is synchronzied -as you get if the power gets toggled and
they all boot up at the same time, there's a risk that all the nodes in the datacentre will
hit the tracker/namenode simultaneously. Even exponential backoff doesnt work if the clocks
are fully synchronized. it helps, but a bit of jitter is needed too just to round things off.
There's enough complexity/duplication here that this could be pushed into a reused class.

Also, maybe the IPC and design decisions could be documented in the wiki

> IPC client doesnt time out if far end handler hangs
> ---------------------------------------------------
>                 Key: HADOOP-3881
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3881
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Steve Loughran
>            Priority: Minor
> This is what appears to be happening in some changes of mine that (inadventently) blocked
JobTracker: if the client can connect to the far end and invoke an operation, the far end
has forever to deal with the request: the client blocks too.
> Clearly the far end shouldn't do this; its a serious problem to address. but should the
client hang? Should it not time out after some specifiable time and signal that the far end
isn't processing requests in a timely manner? 
> (marked as minor as this shouldn't arise in day to day operation. but it should be easy
to create a mock object to simulate this, and timeouts are considered useful in an IPC)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message