hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5399) Revisit SafeModeException and corresponding retry policies
Date Tue, 04 Feb 2014 02:30:10 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890296#comment-13890296

Aaron T. Myers commented on HDFS-5399:

I wanted to make sure i understand this scenario. To me this would happen if the current standby
namenode (nn2) was active before and recently (a few seconds ago) was killed and started causing
it be in safemode and then the active (nn1) at the same time was killed causing the client
to go to nn2 and its still in safemode. Did i understand it right?

I dont believe we hit this scenario as we restarted the active NN every 5 mins. However i
can see the need of client retires to make sure even during the above scenario dfsclient is
able to retry and wait for the nn to come out of safemode.

Yes, that sounds roughly like the scenario I was proposing might be the issue. I'm still not
convinced, however, that this is not roughly the problem. Am I correct in assuming that the
test you were running did not manually cause the NN to enter or leave safemode? If so, that
implies that somehow the NN was staying in startup safemode much longer than it should have.

[~jingzhao] - any updates on reproducing this? As it stands, without knowing more, I think
we should probably revert HDFS-5291 since I think the behavior it introduced is wrong here,
and it was apparently introduced to work around an issue that we don't fully understand yet.

> Revisit SafeModeException and corresponding retry policies
> ----------------------------------------------------------
>                 Key: HDFS-5399
>                 URL: https://issues.apache.org/jira/browse/HDFS-5399
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
> Currently for NN SafeMode, we have the following corresponding retry policies:
> # In non-HA setup, for certain API call ("create"), the client will retry if the NN is
in SafeMode. Specifically, the client side's RPC adopts MultipleLinearRandomRetry policy for
a wrapped SafeModeException when retry is enabled.
> # In HA setup, the client will retry if the NN is Active and in SafeMode. Specifically,
the SafeModeException is wrapped as a RetriableException in the server side. Client side's
RPC uses FailoverOnNetworkExceptionRetry policy which recognizes RetriableException (see HDFS-5291).
> There are several possible issues in the current implementation:
> # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator through
CLI), and the clients may not want to retry on this type of SafeMode.
> # Client may want to retry on other API calls in non-HA setup.
> # We should have a single generic strategy to address the mapping between SafeMode and
retry policy for both HA and non-HA setup. A possible straightforward solution is to always
wrap the SafeModeException in the RetriableException to indicate that the clients should retry.

This message was sent by Atlassian JIRA

View raw message