hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7521) Refactor DN state management
Date Sat, 13 Dec 2014 02:28:13 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ming Ma updated HDFS-7521:
--------------------------
    Description: 
There are two aspects w.r.t. DN state management in NN.

* State machine management within active NN
NN maintains states of each data node regarding whether it is running or being decommissioned.
But the state machine isn’t well defined. We have dealt with some corner case bug in this
area. It will be useful if we can refactor the code to use clear state machine definition
that define events, available states and actions for state transitions. It has these benefits.
** Make it easy to define correctness of DN state management. Currently some of the state
transitions aren't defined in the code. For example, if admins remove a node from include
host file while the node is being decommissioned, it will be transitioned to DEAD and DECOMM_IN_PROGRESS.
That might not be the intention. If we have state machine definition, we can identify this
case.
** Make it easy to add new state for DN later. For example, people discussed about new “maintenance”
state for DN to support the scenario where admins need to take the machine/rack down for 30
minutes for repair.

We can refactor DN with clear state machine definition based on YARN state related components.



* State machine consistency between active and standby NN
Another dimension of state machine management is consistency across NN pairs. We have dealt
with bugs due to different live nodes between active NN and standby NN. Current design is
to have each NN manage its own state based on the events it receives. For example, DNs will
send heartbeat to both NNs; admins will issue decommission commands to both NNs. Alternative
design approach could be to have ZK manage the state.

Thoughts?

  was:
There are two aspects w.r.t. DN state management in NN.

* State machine management within active NN
NN maintains states of each data node regarding whether it is running or being decommissioned.
But the state machine isn’t well defined. We have dealt with some corner case bug in this
area. It will be useful if we can refactor the code to use clear state machine definition
that define events, available states and actions for state transitions. It has these benefits.

** Make it easy to define correctness of DN state management. Currently some of the state
transitions aren't defined in the code. For example, if admins remove a node from include
host file while the node is being decommissioned, it will be transitioned to DEAD and DECOMM_IN_PROGRESS.
That might not be the intention. If we have state machine definition, we can identify this
case.

** Make it easy to add new state for DN later. For example, people discussed about new “maintenance”
state for DN to support the scenario where admins need to take the machine/rack down for 30
minutes for repair.

We can refactor DN with clear state machine definition based on YARN state related components.


* State machine consistency between active and standby NN

Another dimension of state machine management is consistency across NN pairs. We have dealt
with bugs due to different live nodes between active NN and standby NN. Current design is
to have each NN manage its own state based on the events it receives. For example, DNs will
send heartbeat to both NNs; admins will issue decommission commands to both NNs. Alternative
design approach we discuss is to have ZK manage the state.

Thoughts?


> Refactor DN state management
> ----------------------------
>
>                 Key: HDFS-7521
>                 URL: https://issues.apache.org/jira/browse/HDFS-7521
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>
> There are two aspects w.r.t. DN state management in NN.
> * State machine management within active NN
> NN maintains states of each data node regarding whether it is running or being decommissioned.
But the state machine isn’t well defined. We have dealt with some corner case bug in this
area. It will be useful if we can refactor the code to use clear state machine definition
that define events, available states and actions for state transitions. It has these benefits.
> ** Make it easy to define correctness of DN state management. Currently some of the state
transitions aren't defined in the code. For example, if admins remove a node from include
host file while the node is being decommissioned, it will be transitioned to DEAD and DECOMM_IN_PROGRESS.
That might not be the intention. If we have state machine definition, we can identify this
case.
> ** Make it easy to add new state for DN later. For example, people discussed about new
“maintenance” state for DN to support the scenario where admins need to take the machine/rack
down for 30 minutes for repair.
> We can refactor DN with clear state machine definition based on YARN state related components.
> * State machine consistency between active and standby NN
> Another dimension of state machine management is consistency across NN pairs. We have
dealt with bugs due to different live nodes between active NN and standby NN. Current design
is to have each NN manage its own state based on the events it receives. For example, DNs
will send heartbeat to both NNs; admins will issue decommission commands to both NNs. Alternative
design approach could be to have ZK manage the state.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message