hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arpit Gupta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5399) Revisit SafeModeException and corresponding retry policies
Date Thu, 30 Jan 2014 03:00:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886208#comment-13886208
] 

Arpit Gupta commented on HDFS-5399:
-----------------------------------

bq. The test you observed this issue in didn't run long enough for the standby NN to leave
startup safemode on its own before the failover was attempted. The NN will delay processing
block reports for block IDs it doesn't recognize (because they're created in edits that the
NN hasn't read yet) and then only on transition to active do we fully catch up by reading
all the edits, and then re-process the delayed block reports, triggering the NN to leave startup
safemode.

Its not the test that directly fails. We see exceptions in the RM when its trying to talk
to HDFS or in RS when its trying to talk to HDFS which causes the actual MR job etc to fail.
So its not something that the test can control. For example we are running an MR job and are
periodically killing the active NN and the job eventually fails as the tasks that want to
talk to hdfs fail or the RM runs into this exception causing the application to fail. Hence
i would argue that its a flaw in the test :).


> Revisit SafeModeException and corresponding retry policies
> ----------------------------------------------------------
>
>                 Key: HDFS-5399
>                 URL: https://issues.apache.org/jira/browse/HDFS-5399
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>
> Currently for NN SafeMode, we have the following corresponding retry policies:
> # In non-HA setup, for certain API call ("create"), the client will retry if the NN is
in SafeMode. Specifically, the client side's RPC adopts MultipleLinearRandomRetry policy for
a wrapped SafeModeException when retry is enabled.
> # In HA setup, the client will retry if the NN is Active and in SafeMode. Specifically,
the SafeModeException is wrapped as a RetriableException in the server side. Client side's
RPC uses FailoverOnNetworkExceptionRetry policy which recognizes RetriableException (see HDFS-5291).
> There are several possible issues in the current implementation:
> # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator through
CLI), and the clients may not want to retry on this type of SafeMode.
> # Client may want to retry on other API calls in non-HA setup.
> # We should have a single generic strategy to address the mapping between SafeMode and
retry policy for both HA and non-HA setup. A possible straightforward solution is to always
wrap the SafeModeException in the RetriableException to indicate that the clients should retry.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message