hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
Date Mon, 01 Dec 2014 22:44:13 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230601#comment-14230601
] 

Andrew Wang commented on HADOOP-11252:
--------------------------------------

Hi Wilfred, thanks for working on this.

I want to start by making sure I understand the patch correctly. We're changing the default
rpc timeout to be 5min rather than 0. This means that, rather than sending a ping after a
read blocks for 60s, we throw an exception after a read blocks for 5 mins. This actually does
not involve write timeouts in the SO_SNDTIMEO sense, so it seems misleading to call it a "write
timeout". If we get blocked on the socket write, we will still get stuck until the tcp stack
bugs out (the tcp_retries2 you've mentioned elsewhere).

As [~daryn] points out above, and also on HDFS-4858 by [~atm], we've historically been reticent
to change defaults like this because of potential side-effects. I'm not comfortable changing
the defaults here either, without sign-off from e.g. [~daryn] who knows the RPC stuff better.

So, a few review comments:

* Let's rename the config param as Ming recommends above, seems more accurate. Including Ming's
unit test would also be great.
* Let's keep the default value of this at 0 to preserve current behavior, unless [~daryn]
ok's things.
* Since getPingInterval is now package-protected, we should also change setPingInterval to
package-protected for parity. It's only used in a test.
* Need to add the new config key to core-default.xml also, with description.

> RPC client write does not time out by default
> ---------------------------------------------
>
>                 Key: HADOOP-11252
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11252
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.5.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>         Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. This means
that the network connection created will not timeout when used to write data. The issue has
shown in YARN-2578 and HDFS-4858. Timeouts for writes then fall back to the tcp level retry
(configured via tcp_retries2) and timeouts between the 15-30 minutes. Which is too long for
a default behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane value for
the timeout and the "ipc.ping.interval" configuration value is a logical choice for it. The
default behaviour should be changed from 0 to the value read for the ping interval from the
Configuration.
> Fixing it in common makes more sense than finding and changing all other points in the
code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message