From GitBox <...@apache.org>
Subject [GitHub] [hadoop] elek commented on a change in pull request #1196: HDDS-1881. Design doc: decommissioning in Ozone
Date Thu, 01 Aug 2019 10:58:02 GMT
elek commented on a change in pull request #1196: HDDS-1881. Design doc: decommissioning in
URL: https://github.com/apache/hadoop/pull/1196#discussion_r309639099

 File path: hadoop-hdds/docs/content/design/decommissioning.md
 @@ -0,0 +1,720 @@
+title: Decommissioning in Ozone
+summary: Formal process to shut down machines in a safe way after the required replications.
+date: 2019-07-31
+jira: HDDS-1881
+status: current
+author: Anu Engineer, Marton Elek, Stephen O'Donnell 
+# Abstract 
+The goal of decommissioning is to turn off a selected set of machines without data loss.
It may or may not require to move the existing replicas of the containers to other nodes.
+There are two main classes of the decommissioning:
+ * __Maintenance mode__: where the node is expected to be back after a while. It may not
require replication of containers if enough replicas are available from other nodes (as we
expect to have the current replicas after the restart.)
+ * __Decommissioning__: where the node won't be started again. All the data should be replicated
according to the current replication rules.
+ * Decommissioning can be canceled any time
+ * The progress of the decommissioning should be trackable
+ * The nodes under decommissioning / maintenance mode should not been used for new pipelines
/ containers
+ * The state of the datanodes should be persisted / replicated by the SCM (in HDFS the decommissioning
info exclude/include lists are replicated manually by the admin). If datanode is marked for
decommissioning this state be available after SCM and/or Datanode restarts.  
+ * We need to support validations before decommissioing (but the violations can be ignored
by the admin).
+ * The administrator should be notified when a node can be turned off.
+ * The maintenance mode can be time constrained: if the node marked for maintenance for 1
week and the node is not up after one week, the containers should be considered as lost (DEAD
node) and should be replicated.
+# Introduction
+Ozone is a highly available file system that relies on commodity hardware. In other words,
Ozone is designed to handle failures of these nodes all the time.
+The Storage Container Manager(SCM) is designed to monitor the node health and replicate blocks
and containers as needed.
+At times, Operators of the cluster can help the SCM by giving it hints. When removing a datanode,
the operator can provide a hint. That is, a planned failure of the node is coming up, and
SCM can make sure it reaches a safe state to handle this planned failure.
+Some times, this failure is transient; that is, the operator is taking down this node temporarily.
In that case, we can live with lower replica counts by being optimistic.
+Both of these operations, __Maintenance__, and __Decommissioning__ are similar from the Replication
point of view. In both cases, and the user instructs us on how to handle an upcoming failure.
+Today, SCM (*Replication Manager* component inside SCM) understands only one form of failure
handling. This paper extends Replica Manager failure modes to allow users to request which
failure handling model to be adopted(Optimistic or Pessimistic).
+Based on physical realities, there are two responses to any perceived failure, to heal the
system by taking corrective actions or ignore the failure since the actions in the future
will heal the system automatically.
+## User Experiences (Decommissioning vs Maintenance mode)
+From the user's point of view, there are two kinds of planned failures that the user would
like to communicate to Ozone.
+The first kind is when a 'real' failure is going to happen in the future. This 'real' failure
is the act of decommissioning. We denote this as "decommission" throughout this paper. The
response that the user wants is SCM/Ozone to make replicas to deal with the planned failure.
+The second kind is when the failure is 'transient.' The user knows that this failure is temporary
and cluster in most cases can safely ignore this issue. However, if the transient failures
are going to cause a failure of availability; then the user would like the Ozone to take appropriate
actions to address it.  An example of this case, is if the user put 3 data nodes into maintenance
mode and switched them off.
+The transient failure can violate the availability guarantees of Ozone; Since the user is
telling us not to take corrective actions. Many times, the user does not understand the impact
on availability while asking Ozone to ignore the failure.
+So this paper proposes the following definitions for Decommission and Maintenance of data
+__Decommission__ of a data node is deemed to be complete when SCM/Ozone completes the replica
of all containers on decommissioned data node to other data nodes.That is, the expected count
matches the healthy count of containers in the cluster.
+__Maintenance mode__ of a data node is complete if Ozone can guarantee at least one copy
of every container is available in other healthy data nodes.
+## Examples 
+Here are some illustrative examples:
+1.  Let us say we have a container, which has only one copy and resides on Machine A. If
the user wants to put machine A into maintenance mode; Ozone will make a replica before entering
the maintenance mode.
+2. Suppose a container has two copies, and the user wants to put Machine A to maintenance
mode. In this case; the Ozone understands that availability of the container is not affected
and hence can decide to forgo replication.
+3. Suppose a container has two copies, and the user wants to put Machine A into maintenance
mode. However, the user wants to put the machine into maintenance mode for one month. As the
period of maintenance mode increases, the probability of data loss increases; hence, Ozone
might choose to make a replica of the container even if we are entering maintenance mode.
+4. The semantics of decommissioning means that as long as we can find copies of containers
in other machines, we can technically get away with calling decommission complete. Hence this
clarification node; in the ordinary course of action; each decommission will create a replication
flow for each container we have; however, it is possible to complete a decommission of a data
node, even if we get a failure of the  data node being decommissioned. As long as we can find
the other datanodes to replicate from and get the number of replicas needed backup to expected
count we are good.
+5. Let us say we have a copy of a container replica on Machine A, B, and C. It is possible
to decommission all three machines at the same time, as decommissioning is just a status indicator
of the data node and until we finish the decommissioning process.
+The user-visible features for both of these  are very similar:
+Both Decommission and Maintenance mode can be canceled any time before the operation is marked
as completed by SCM.
+Decommissioned nodes, if and when added back, shall be treated as new data nodes; if they
have blocks or containers on them, they can be used to reconstruct data.
+## Mainteneance mode in HDFS
+HDFS supports decommissioning and maintenance mode similar to Ozone. This is a quick description
of the HDFS approach.
+The usage of HDFS maintenance mode:
+  * First, you set a minimum replica count on the cluster, which can be zero, but defaults
to 1.
+  * Then you can set a number of nodes into maintenance, with an expiry time or have them
remain in maintenance forever, until they are manually removed. Nodes are put into maintenance
in much the same way as nodes are decommissioned.
+ * When a set of nodes go into maintenance, all blocks hosted on them are scanned and if
the node going into maintenance would cause the number of replicas to fall below the minimum
replica count, the relevant nodes go into a decommissioning like state while new replicas
are made for the blocks.
+  * Once the node goes into maintenance, it can be stopped etc and HDFS will not be concerned
about the under-replicated state of the blocks.
+  * When the expiry time passes, the node is put back to normal state (if it is online and
heartbeating) or marked as dead, at which time new replicas will start to be made.
+This is very similar to decommissioning, and the code to track maintenance mode and ensure
the blocks are replicated etc, is effectively the same code as with decommissioning. The one
area that differs is probably in the replication monitor as it must understand that the node
is expected to be offline.
+The ideal way to use maintenance mode, is when you know there are a set of nodes you can
stop without having to do any replications. In HDFS, the rack awareness states that all blocks
should be on two racks, so that means a rack can be put into maintenance safely.
+There is another feature in HDFS called "upgrade Domain" which allows each datanode to be
assigned a group. By default there should be at least 3 groups (domains) and then each of
the 3 replicas will be stored on different group, allowing one full group to be put into maintenance
at once. That is not yet supported in CDH, but is something we are targeting for CDPD I believe.
+One other difference with maintenance mode and decommissioning, is that you must have some
sort of monitor thread checking for when maintenance is scheduled to end. HDFS solves this
by having a class called the DatanodeAdminManager, and it tracks all nodes transitioning state,
the under-replicated block count on them etc.
+# Implementation
+## Datanode state machine
+`NodeStateManager` maintains the state of the connected datanodes. The possible states:
+  state             | description
+  ------------------|------------
+  HEALTHY           | The node is up and running.
+  STALE             | Some heartbeats were missing for an already missing nodes.
+  DEAD              | The stale node has not been recovered.
+  ENTER_MAINTENANCE | The in-progress state, scheduling is disabled but the node can't not
been turned off due to in-progress replication.
+  IN_MAINTENANCE    | Node can be turned off but we expecteed to get it back and have all
the replicas.
+  DECOMMISSIONING   | The in-progress state, scheduling is disabled, all the containers should
be replicated to other nodes.
+  DECOMMISSIONED    | The node can be turned off, all the containers are replicated to other
+## High level algorithm
+The Algorithm is pretty simple from the Decommission or Maintenance point of view;
+ 1. Mark a data node as DECOMMISSIONING or ENTERING_MAINTENANCE. This implies that node is
NOT healthy anymore; we assume the use of a single flag and law of excluded middle.
+ 2. Pipelines should be shut down and wait for confirmation that all pipelines are shutdown.
So no new I/O or container creation can happen on a Datanode that is part of decomm/maint.
+ 3. Once the Node has been marked as DECOMMISSIONING or ENTERING_MAINTENANCE; the Node will
generate a list of containers that need replication. This list is generated by the Replica
Count decisions for each container; the Replica Count will be computed by Replica Manager;

+ 4. Once the Replica Count for these containers go back to Zero, which means that we have
finished with the pending replications, the containers from this wait list will be removed.
+ 5. Once the size of the waitlist reaches zero; maintenance mode or decommission is complete.
+ 5. We will update the node state to DECOMMISSIONED or IN_MAINTENANCE reached state.
+_Replica count_ is a calculated number which represents the number of _missing_ replicas.
The number can be negative in case of an over-replicated container.
+## Calculation of the _Replica count_ (required replicas)
+### Counters / Variables
+We have 7 different datanode state and three different type of container state (replicated
or in-flight deletion / in-flight replication). To calculate the required replicas we should
introduce a few variables.
+Note: we don't need to use all the possible counters but the following table summarize how
the counters are calculated for the following algorithm.
+For example the `maintenance` variable includes the number of the existing replicas on ENTERING_MAINTENANCE
+Each counters should be calculated per container bases.
+   Node state                            | Containers - in-flight deletion | In-Flight  
+   --------------------------------------|---------------------------------|-------------------------|
+   HEALTHY	                             | `healthy`                       | `inFlight`
+   STALE + DEAD + DECOMMISSIONED	     |                                 |
+   DECOMMISSIONING                       |                                 |
+   ENTERING_MAINTENANCE + IN_MAINTENANCE | `maintenance`                   |
+### The current replication model
+The current replication model in SCM/Ozone is very simplistic. We compute the replication
count or the number of replications that we need to do as:
+Replica count = expectedCount - currentCount
+In case the _Replica count_ is positive, it means that we need to make more replicas. If
the number is negative, it means that we are over replicated and we need to remove some replicas
of this container. If the Replica count for a container is zero; it means that we have the
expected number of containers in the cluster.
+To support idempontent placement strategies we include the in-fligt replications in the `currentCount`:
If there are one in-flight replication process and two replicas we won't start a new replication
command unless the original command is timed out.
+The timeout is configured with `hdds.scm.replication.event.timeout` and the default value
is 10 minutes.
+More preciously the current algorithm is the following:
+Replica count = expectedCount - healthy - inFlight
+### The proposed solution
+To support the notion that a user can provide hints to the replication model, we propose
to add two variables to the current model.
+In the new model, we propose to break the `currentCount` into the three separate groups.
That is _Healthy nodes_, _Maintenance nodes_, and _Decommission nodes_. The new model replaces
the currentCount with these three separate counts. The following function captures the code
that drives the logic of computing Replica counts in the new model. The table below discusses
the input and output of this model very extensively.
+ * Calculate the number of the missing replicas.
+ * 
+ * @return the number of the missing replicas. If it's less than zero, the container is over
+ */
+int getReplicationCount(int expectedCount, int healthy, 
+   int maintenance, int inFlight) {
+   //for over replication, count only with the healthy replicas
+   if (expectedCount < healthy) {
+      return expectedCount - healthy;
+   }
+   replicaCount = expectedCount - (healthy + maintenance + inFlight);
+   if (replicaCount == 0 && healthy < 1) {
+      replicaCount ++;
+   }
+   //over replication is already handled
+   return Math.max(0, replicaCount);
+We also need to specify two end condition when the DECOMMISSIONING node can be moved to the
DECOMMISSIONED state or the ENTERING_MAINTENANCE mode can be moved to the IN_MAINTENANCE state.
+The following conditions should be true for all the containers and all the containers on
the specific node should be closed.
+ * There are at least one healthy replica
+ * There are at most one missing replica
+Which means that node can be decommissioned if: 
+ * all the containers with replication factor THREE have at least *one replica* on a HEALTHY
nodes (minimum.live.replicas)
 Review comment:
   I tried to make more clear this stop condition in the next version of the document. I agree
with you that we need 3 replicas on other nodes from all the containers but what about other
nodes in maintenance mode?
   This line would like to define that we need 3 replicas (defined in the next line) *AND*
at least one of the replicas should be server by a HEALTHY node.
   So the stop condition for decommissioning should be:
   (healthy >= 1) && (healthy + maintenance >= 3)
   (for maintenance mode the second part can be ignored)

