hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wilfred Spiegelenburg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
Date Wed, 03 Dec 2014 17:22:14 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233209#comment-14233209
] 

Wilfred Spiegelenburg commented on HADOOP-11252:
------------------------------------------------

[~andrew.wang] due to the way the rpc timeout in the client code overwrites the ping time
out you are most likely correct. I'll have to step through the code in the client to make
sure it behaves as intended. The ping is generated after a {{SocketTimeoutException}} is thrown
on the input stream which is triggered by the {{setSoTimeout(pingInterval)}} on the socket,
combined with the overwrite that could be a problem. This might require a further decoupling
of the ping and rpc time out.

I also noticed that the ping output stream is created with a fixed timeout of 0, that means
we can still hang up there after the changes.

Looking at the HDFS code also to see how it is handled there and all references to the timeout
that we are setting are called "socket write time out". I am happy to call it something else
but this seems to be in line with HDFS also. The SO_SNDTIMEO only comes into play when the
send buffers on the OS level on the local machine are full (as far as I am aware). If the
buffer was not full when I wrote the data the time out will never trigger and I directly fall
through to the tcp retries. That case should be handled by the time out we are setting.

The default change was a proposal and setting it to 0 is the right choice for backwards compatibility.

> RPC client write does not time out by default
> ---------------------------------------------
>
>                 Key: HADOOP-11252
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11252
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.5.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>         Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. This means
that the network connection created will not timeout when used to write data. The issue has
shown in YARN-2578 and HDFS-4858. Timeouts for writes then fall back to the tcp level retry
(configured via tcp_retries2) and timeouts between the 15-30 minutes. Which is too long for
a default behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane value for
the timeout and the "ipc.ping.interval" configuration value is a logical choice for it. The
default behaviour should be changed from 0 to the value read for the ping interval from the
Configuration.
> Fixing it in common makes more sense than finding and changing all other points in the
code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message