hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15321) Reduce the RPC Client max retries on timeouts
Date Thu, 22 Mar 2018 18:15:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410025#comment-16410025

Xiao Chen commented on HADOOP-15321:

Thanks a lot for the history and tracing back to SVN [~kihwal]! A great lecture. :) You're
also right that the connection timeout default was 20.

There were also more things to share from failure(s) I was seeing, and it's actually a mix
of things. Apologies I as confused initially and didn't clarify between the 2 timeouts. The
specific error I see was from Impala, but it's really just calling through JNI to dfsclients.

1. There is the 60 second timeout for the actual read, when setting up tcp connection to the
DN. This is okay because the DN will be added to dead nodes and the next try will hit another
DN, which would succeed.
W0125 23:37:35.947903 22700 DFSInputStream.java:696] Failed to connect to /DN:20003 for block,
add to deadNodes and continue. org.apache.hadoop.net.ConnectTimeoutException: 60000 millis
timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
Java exception follows:
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel
to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/DN0:20003]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
        at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3530)
        at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:840)
        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:755)
        at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:658)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:895)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:972)
        at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:147)
I0125 23:37:35.953243 22700 DFSInputStream.java:678] Successfully connected to /DN:20003 for
The version we saw did not have HDFS-11993 though, but looking at the event time and log patterns,
I think this is must be the case.

2. There is also the 45 time retries, which we do not have stacktraces.
I0125 23:50:06.012015 22689 Client.java:870] Retrying connect to server: DATANODE:50020. Already
tried 44 time(s); maxRetries=45
This is 20 seconds apart, but a consecutive 45 retries. No stacktrace or other interesting
information logged because debug wasn't turned on.

Regarding the fix, your advice makes sense to me. To make sure my understanding is correct,
we can configure the client -> DN ipc to not retry, but do our own retries similar to [the
existing way of adding a DN to deadnodes and retry on the next DN|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L598]

If no objections I can give it a shot soon...

> Reduce the RPC Client max retries on timeouts
> ---------------------------------------------
>                 Key: HADOOP-15321
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15321
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>            Priority: Major
> Currently, the [default|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeysPublic.java#L379] number
of retries when IPC client catch a {{ConnectTimeoutException}} is 45. This seems unreasonably
> Given the IPC client timeout is by default 60 seconds, if a DN host is shutdown the
client will retry for 45 minutes until aborting. (If host is there but process down, it would
throw a connection refused immediately, which is cool)
> Creating this Jira to discuss whether we can reduce that to a reasonable number.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message