hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Íñigo Goiri (JIRA) <j...@apache.org>
Subject [jira] [Commented] (HDFS-13119) RBF: Manage unavailable clusters
Date Wed, 07 Feb 2018 23:07:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356211#comment-16356211

Íñigo Goiri commented on HDFS-13119:

We had this happening the other day when we added a subcluster for testing and the Namenodes
in this subcluster were down for a few days. The Routers ended up with thousands of threads
trying to do RPC connections to the Namenodes that were down. One example was {{renewLease()}},
this operation is executed in all the subclusters and we were had connections stuck for more
than 3 minutes because the default retry policy was to try 10 times with a timeout of 20 seconds.

We should do a couple things:
* Better control of the number of RPC clients
* No need to try so many times if we "know" the subcluster is down

> RBF: Manage unavailable clusters
> --------------------------------
>                 Key: HDFS-13119
>                 URL: https://issues.apache.org/jira/browse/HDFS-13119
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Íñigo Goiri
>            Priority: Major
> When a federated cluster has one of the subcluster down, operations that run in every
subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC connections.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message