hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hanisha Koneru (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13442) Ozone: Handle Datanode Registration failure
Date Tue, 17 Apr 2018 21:35:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441526#comment-16441526
] 

Hanisha Koneru commented on HDFS-13442:
---------------------------------------

Thanks for the review [~anu].

This patch only modifies the case when we get _errorNodeNotPermitted_. This happens when
the node is able to contact the SCM but SCM does not register the node. 
{quote}if the data nodes boot up earlier than SCM we would not want the data nodes to do silent
after 10 tries
{quote}
In this case, the datanode keeps retrying as the EndPointTask state remains as {{HEARTBEAT}}.
In the code snippet below, if the datanode does not get a response from SCM, it catches the
exception and logs it, if needed.
{code:java}
    try {
      SCMRegisteredCmdResponseProto response = rpcEndPoint.getEndPoint()
          .register(datanodeDetails.getProtoBufMessage(),
              conf.getStrings(ScmConfigKeys.OZONE_SCM_NAMES));
      ...
      ...
      processResponse(response);
    } catch (IOException ex) {
      rpcEndPoint.logIfNeeded(ex);
    }
{code}
{quote}also in the case, we get the error, errorNodeNotPermitted, should we shut down the
data node and create some kind of error record on SCM so we can get that info back from SCM?
I am also ok with the current approach where we will let the system slowly go time out.
{quote}
I think we should let the DN make a few retries before shutting it down.

> Ozone: Handle Datanode Registration failure
> -------------------------------------------
>
>                 Key: HDFS-13442
>                 URL: https://issues.apache.org/jira/browse/HDFS-13442
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ozone
>    Affects Versions: HDFS-7240
>            Reporter: Hanisha Koneru
>            Assignee: Hanisha Koneru
>            Priority: Major
>         Attachments: HDFS-13442-HDFS-7240.001.patch
>
>
> If a datanode is not able to register itself, we need to handle that correctly. 
> If the number of unsuccessful attempts to register with the SCM exceeds a configurable
max number, the datanode should not make any more attempts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message