hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weiwei Yang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
Date Fri, 07 Jul 2017 15:12:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078213#comment-16078213
] 

Weiwei Yang edited comment on HDFS-12098 at 7/7/17 3:11 PM:
------------------------------------------------------------

This is because datanode state machine leaks {{VersionEndpointTask}} thread. In the case scm
is not yet started,
 more and more {{VersionEndpointTask}} threads keep retrying connection with scm,

{noformat}
INIT - RUNNING 
                 \
                GETVERSION
                     new VersionEndpointTask submitted - retrying ...
                               ... (HB interval)
                     new VersionEndpointTask submitted - retrying ...
                               ... (HB interval)
                     new VersionEndpointTask submitted - retrying ...
                               ...
{noformat}

the version endpoint tasks are launched in HB interval (5s on my env), so every 5s there is
a new task submitted; the retry policy for each getVersion call is 10 * 1s = 10s, so every
10s a task can be finished. So every 10s there will be ONE thread leak.

When scm is up, all pending tasks will be able to connect to scm and getVersion call returns,
so each of them will count the state to next, since the state is shared in {{EndpointStateMachine}},
it increments more than 1 so when I review the state changes, it looks like below

{noformat}
REGISTER
HEARTBEAT
SHUTDOWN
SHUTDOWN
SHUTDOWN
... 
{noformat}


was (Author: cheersyang):
This is because datanode state machine leaks {{VersionEndpointTask}} thread. In the case scm
is not yet started,
 more and more {{VersionEndpointTask}} threads keep retrying connection with scm,

{noformat}
INIT - RUNNING 
                 \
                GETVERSION
                       executor.execute(new VersionEndpointTask()) - retry on getVersion ...
                               ... (HB interval)
                       executor.execute(new VersionEndpointTask()) - retry on getVersion ...
                               ... (HB interval)
                       executor.execute(new VersionEndpointTask()) - retry on getVersion ...
                               ...
{noformat}

the version endpoint tasks are launched in HB interval (5s on my env), so every 5s there is
a new task submitted; the retry policy for each getVersion call is 10 * 1s = 10s, so every
10s a task can be finished. So every 10s there will be ONE thread leak.

When scm is up, all pending tasks will be able to connect to scm and getVersion call returns,
so each of them will count the state to next, since the state is shared in {{EndpointStateMachine}},
it increments more than 1 so when I review the state changes, it looks like below

{noformat}
REGISTER
HEARTBEAT
SHUTDOWN
SHUTDOWN
SHUTDOWN
... 
{noformat}

> Ozone: Datanode is unable to register with scm if scm starts later
> ------------------------------------------------------------------
>
>                 Key: HDFS-12098
>                 URL: https://issues.apache.org/jira/browse/HDFS-12098
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, ozone, scm
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Critical
>         Attachments: thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state machine could
transit to RUNNING. However in actual, its state transits to SHUTDOWN, datanode enters chill
mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message