hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10127) Add ipc.client.connect.retry.interval to control the frequency of connection retries
Date Wed, 27 Nov 2013 11:01:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833654#comment-13833654
] 

Steve Loughran commented on HADOOP-10127:
-----------------------------------------

Which clients are you thinking of here? 
What we need to avoid is is overload on any failover restart operations at v. large scale
clusters, where the scenarios are
# a master service fails, failover begins and all the worker nodes in the cluster generate
large numbers of connect requests to the successor. 
# cluster power restart event where all started nodes/clients start hitting the booting services
in near-perfect sync. I've seen this with flash-based devices where the boot time is constant
for all nodes -it's why jitter is important, and why clock-based & time-since-boot jitter
needs an extra bit of randomness
# server offline explicitly with heavy client load coming in from outside. Here the more clients
that block retrying connection requests build up more and more pending calls, so the server
ends up receiving a massive multiple of the normal working load the moment it goes live.
# more than one of the above problems. This is what led to the infamous facebook HDFS cascade
failure -and hence why NN heartbeats now come in on a different RPC port from DFS client operations.

Shrink the retry time and the load generated against starting/failing over endpoints can increase
massively. That doesn't mean it shouldn't be allowed -just that you need to understand that
special problems arise at a few thousand servers and plan for it.

If it really is NM->RM calls you are worried about, then perhaps rather than make changes
to the general IPC client, this is a good time to impose a better retry policy here, where
exponential backoff with jitter is what I'd propose. The initial delay could be small, but
it would back off fast if the cluster was down for any length of time

> Add ipc.client.connect.retry.interval to control the frequency of connection retries
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-10127
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10127
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.2.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>         Attachments: hadoop-10127-1.patch
>
>
> Currently, {{ipc.Client}} client attempts to connect to the server every 1 second. It
would be nice to make this configurable to be able to connect more/less frequently. Changing
the number of retries alone is not granular enough.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message