hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-14652) HealthMonitor connection retry times should be configurable
Date Tue, 06 Aug 2019 03:44:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900586#comment-16900586

Chen Zhang commented on HDFS-14652:

uploaded patch v3

Hi [~jojochuang], I uploaded a full patch, if you only need the additional part, I'll re-submit
a new one, thanks.

> HealthMonitor connection retry times should be configurable
> -----------------------------------------------------------
>                 Key: HDFS-14652
>                 URL: https://issues.apache.org/jira/browse/HDFS-14652
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Chen Zhang
>            Assignee: Chen Zhang
>            Priority: Major
>             Fix For: 3.3.0
>         Attachments: HDFS-14652-001.patch, HDFS-14652-002.patch, HDFS-14652.003.patch
> On our production HDFS cluster, some client's burst requests cause the tcp kernel queue
full on NameNode's host,  since the configuration value of "net.ipv4.tcp_syn_retries" in
our environment is 1, so after 3 seconds, the ZooKeeper Healthmonitor got an connection error
like this:
> {code:java}
> WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor
health of NameNode at nn_host_name/ip_address:port: Call From zkfc_host_name/ip to nn_host_name:port
failed on connection exception: java.net.ConnectException: Connection timed out; For more
details see: http://wiki.apache.org/hadoop/ConnectionRefused
> {code}
> This error caused a failover and affects the availability of that cluster, we fixed this issue
by enlarge the kernel parameter net.ipv4.tcp_syn_retries to 6
> But during working on this issue, we found that the connection retry time(ipc.client.connect.max.retries)
of health-monitor is hard coded as 1, I think it should be configurable, then if we don't
want the health-monitor so sensitive, we can change it's behavior by change this configuration

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message