ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Fernandez" <>
Subject Review Request 36231: Revert, The Default hdfs-site.xml Should Have Client Retry Logic Enabled For Rolling Upgrade
Date Mon, 06 Jul 2015 23:58:46 GMT

This is an automatically generated e-mail. To reply, visit:

Review request for Ambari, Jonathan Hurley and Nate Cole.

Repository: ambari


In the case of an HA cluster where the former primary NN was killed "dirty", by catastrophic
power-down or equivalent, and the cluster has successfully failed over to the other NN, a
client that first attempts to contact the dead NN takes 10 minutes to switch to the other

In Ambari 2.0 and HDP 2.2, dfs.client.retry.policy.enabled was not set at all.
Recently, in Ambari 2.1 for HDP 2.3, it was defaulted to true as part of AMBARI-11192.
However, this causes problems during RU

In an HA setup, our retry actually should be handled by RetryInvocationHandler using retry
policy FailoverOnNetworkExceptionRetry. The client first translates the nameservice ID into
two host names, and creates an individual RPC proxy for each NameNode accordingly. Each individual
NameNode proxy still uses MultipleLinearRandomRetry as its local retry policy, but because
we usually set dfs.client.retry.policy.enabled to false, thus this internal retry is actually
disabled. Then in case we hit any connection issue or remote exception (including StandbyException),
the exception is caught by RetryInvocationHandler and handled according to FailoverOnNetworkExceptionRetry.
In this way the client can failover to the other namenode immediately instead of keeping retrying
the same NameNode.
However, here because we set dfs.client.retry.policy.enabled to true, the MultipleLinearRandomRetry
is triggered inside of the internal NameNode proxy thus we have to wait 10+ min. The exception
is finally thrown to RetryInvocationHandler until all the retries of MultipleLinearRandomRetry


  ambari-server/src/main/java/org/apache/ambari/server/checks/ 5e029f4




Unit tests passed,

Total run:761
Total errors:0
Total failures:0

I deployed my changes to a brand new cluster and it correctly set the hdfs-site property dfs.client.retry.policy.enabled
to false.


Alejandro Fernandez

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message