From hdfs-issues-return-274686-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Wed Jul 31 13:04:13 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 0BA1118062B for ; Wed, 31 Jul 2019 15:04:12 +0200 (CEST) Received: (qmail 22072 invoked by uid 500); 31 Jul 2019 13:04:07 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 21908 invoked by uid 99); 31 Jul 2019 13:04:07 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2019 13:04:06 +0000 Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id DC8C1E2F91 for ; Wed, 31 Jul 2019 13:04:05 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 8DEDF2663A for ; Wed, 31 Jul 2019 13:04:05 +0000 (UTC) Date: Wed, 31 Jul 2019 13:04:05 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Work logged] (HDDS-1881) Design doc: decommissioning in Ozone MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDDS-1881?focusedWorklogId=3D2= 85768&page=3Dcom.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpan= el#worklog-285768 ] ASF GitHub Bot logged work on HDDS-1881: ---------------------------------------- Author: ASF GitHub Bot Created on: 31/Jul/19 13:03 Start Date: 31/Jul/19 13:03 Worklog Time Spent: 10m=20 Work Description: hadoop-yetus commented on pull request #1196: HDDS-= 1881. Design doc: decommissioning in Ozone URL: https://github.com/apache/hadoop/pull/1196#discussion_r309206899 =20 =20 ########## File path: hadoop-hdds/docs/content/design/decommissioning.md ########## @@ -0,0 +1,720 @@ +--- +title: Decommissioning in Ozone +summary: Formal process to shut down machines in a safe way after the requ= ired replications. +date: 2019-07-31 +jira: HDDS-1881 +status: current +author: Anu Engineer, Marton Elek, Stephen O'Donnell=20 +--- + + +# Abstract=20 + +The goal of decommissioning is to turn off a selected set of machines with= out data loss. It may or may not require to move the existing replicas of t= he containers to other nodes. + +There are two main classes of the decommissioning: + + * __Maintenance mode__: where the node is expected to be back after a whi= le. It may not require replication of containers if enough replicas are ava= ilable from other nodes (as we expect to have the current replicas after th= e restart.) + + * __Decommissioning__: where the node won't be started again. All the dat= a should be replicated according to the current replication rules. + +Goals: + + * Decommissioning can be canceled any time + * The progress of the decommissioning should be trackable + * The nodes under decommissioning / maintenance mode should not been used= for new pipelines / containers + * The state of the datanodes should be persisted / replicated by the SCM = (in HDFS the decommissioning info exclude/include lists are replicated manu= ally by the admin). If datanode is marked for decommissioning this state be= available after SCM and/or Datanode restarts. =20 + * We need to support validations before decommissioing (but the violation= s can be ignored by the admin). + * The administrator should be notified when a node can be turned off. + * The maintenance mode can be time constrained: if the node marked for ma= intenance for 1 week and the node is not up after one week, the containers = should be considered as lost (DEAD node) and should be replicated. + +# Introduction + +Ozone is a highly available file system that relies on commodity hardware.= In other words, Ozone is designed to handle failures of these nodes all th= e time. + +The Storage Container Manager(SCM) is designed to monitor the node health = and replicate blocks and containers as needed. + +At times, Operators of the cluster can help the SCM by giving it hints. Wh= en removing a datanode, the operator can provide a hint. That is, a planned= failure of the node is coming up, and SCM can make sure it reaches a safe = state to handle this planned failure. + +Some times, this failure is transient; that is, the operator is taking dow= n this node temporarily. In that case, we can live with lower replica count= s by being optimistic. + +Both of these operations, __Maintenance__, and __Decommissioning__ are sim= ilar from the Replication point of view. In both cases, and the user instru= cts us on how to handle an upcoming failure. + +Today, SCM (*Replication Manager* component inside SCM) understands only o= ne form of failure handling. This paper extends Replica Manager failure mod= es to allow users to request which failure handling model to be adopted(Opt= imistic or Pessimistic). + +Based on physical realities, there are two responses to any perceived fail= ure, to heal the system by taking corrective actions or ignore the failure = since the actions in the future will heal the system automatically. + +## User Experiences (Decommissioning vs Maintenance mode) + +From the user's point of view, there are two kinds of planned failures tha= t the user would like to communicate to Ozone. + +The first kind is when a 'real' failure is going to happen in the future. = This 'real' failure is the act of decommissioning. We denote this as "decom= mission" throughout this paper. The response that the user wants is SCM/Ozo= ne to make replicas to deal with the planned failure. + +The second kind is when the failure is 'transient.' The user knows that th= is failure is temporary and cluster in most cases can safely ignore this is= sue. However, if the transient failures are going to cause a failure of ava= ilability; then the user would like the Ozone to take appropriate actions t= o address it. An example of this case, is if the user put 3 data nodes int= o maintenance mode and switched them off. + +The transient failure can violate the availability guarantees of Ozone; Si= nce the user is telling us not to take corrective actions. Many times, the = user does not understand the impact on availability while asking Ozone to i= gnore the failure. + +So this paper proposes the following definitions for Decommission and Main= tenance of data nodes. + +__Decommission__ of a data node is deemed to be complete when SCM/Ozone co= mpletes the replica of all containers on decommissioned data node to other = data nodes.That is, the expected count matches the healthy count of contain= ers in the cluster. + +__Maintenance mode__ of a data node is complete if Ozone can guarantee at = least one copy of every container is available in other healthy data nodes. + +## Examples=20 + +Here are some illustrative examples: + +1. Let us say we have a container, which has only one copy and resides on= Machine A. If the user wants to put machine A into maintenance mode; Ozone= will make a replica before entering the maintenance mode. + +2. Suppose a container has two copies, and the user wants to put Machine A= to maintenance mode. In this case; the Ozone understands that availability= of the container is not affected and hence can decide to forgo replication= . + +3. Suppose a container has two copies, and the user wants to put Machine A= into maintenance mode. However, the user wants to put the machine into mai= ntenance mode for one month. As the period of maintenance mode increases, t= he probability of data loss increases; hence, Ozone might choose to make a = replica of the container even if we are entering maintenance mode. + +4. The semantics of decommissioning means that as long as we can find copi= es of containers in other machines, we can technically get away with callin= g decommission complete. Hence this clarification node; in the ordinary cou= rse of action; each decommission will create a replication flow for each co= ntainer we have; however, it is possible to complete a decommission of a da= ta node, even if we get a failure of the data node being decommissioned. A= s long as we can find the other datanodes to replicate from and get the num= ber of replicas needed backup to expected count we are good. + +5. Let us say we have a copy of a container replica on Machine A, B, and C= . It is possible to decommission all three machines at the same time, as de= commissioning is just a status indicator of the data node and until we fini= sh the decommissioning process. + + +The user-visible features for both of these are very similar: + +Both Decommission and Maintenance mode can be canceled any time before the= operation is marked as completed by SCM. + +Decommissioned nodes, if and when added back, shall be treated as new data= nodes; if they have blocks or containers on them, they can be used to reco= nstruct data. + + +## Mainteneance mode in HDFS + +HDFS supports decommissioning and maintenance mode similar to Ozone. This = is a quick description of the HDFS approach. + +The usage of HDFS maintenance mode: + + * First, you set a minimum replica count on the cluster, which can be ze= ro, but defaults to 1. + * Then you can set a number of nodes into maintenance, with an expiry ti= me or have them remain in maintenance forever, until they are manually remo= ved. Nodes are put into maintenance in much the same way as nodes are decom= missioned. + * When a set of nodes go into maintenance, all blocks hosted on them are = scanned and if the node going into maintenance would cause the number of re= plicas to fall below the minimum replica count, the relevant nodes go into = a decommissioning like state while new replicas are made for the blocks. + * Once the node goes into maintenance, it can be stopped etc and HDFS wi= ll not be concerned about the under-replicated state of the blocks. + * When the expiry time passes, the node is put back to normal state (if = it is online and heartbeating) or marked as dead, at which time new replica= s will start to be made. + +This is very similar to decommissioning, and the code to track maintenance= mode and ensure the blocks are replicated etc, is effectively the same cod= e as with decommissioning. The one area that differs is probably in the rep= lication monitor as it must understand that the node is expected to be offl= ine. + +The ideal way to use maintenance mode, is when you know there are a set of= nodes you can stop without having to do any replications. In HDFS, the rac= k awareness states that all blocks should be on two racks, so that means a = rack can be put into maintenance safely. + +There is another feature in HDFS called "upgrade Domain" which allows each= datanode to be assigned a group. By default there should be at least 3 gro= ups (domains) and then each of the 3 replicas will be stored on different g= roup, allowing one full group to be put into maintenance at once. That is n= ot yet supported in CDH, but is something we are targeting for CDPD I belie= ve. + +One other difference with maintenance mode and decommissioning, is that yo= u must have some sort of monitor thread checking for when maintenance is sc= heduled to end. HDFS solves this by having a class called the DatanodeAdmin= Manager, and it tracks all nodes transitioning state, the under-replicated = block count on them etc. + + +# Implementation + + +## Datanode state machine + +`NodeStateManager` maintains the state of the connected datanodes. The pos= sible states: + + state | description + ------------------|------------ + HEALTHY | The node is up and running. + STALE | Some heartbeats were missing for an already missing = nodes. + DEAD | The stale node has not been recovered. + ENTER_MAINTENANCE | The in-progress state, scheduling is disabled but th= e node can't not been turned off due to in-progress replication. + IN_MAINTENANCE | Node can be turned off but we expecteed to get it ba= ck and have all the replicas. + DECOMMISSIONING | The in-progress state, scheduling is disabled, all t= he containers should be replicated to other nodes. + DECOMMISSIONED | The node can be turned off, all the containers are r= eplicated to other machine + =20 + + +## High level algorithm + +The Algorithm is pretty simple from the Decommission or Maintenance point = of view; + + 1. Mark a data node as DECOMMISSIONING or ENTERING_MAINTENANCE. This impl= ies that node is NOT healthy anymore; we assume the use of a single flag an= d law of excluded middle. + + 2. Pipelines should be shut down and wait for confirmation that all pipel= ines are shutdown. So no new I/O or container creation can happen on a Data= node that is part of decomm/maint. + + 3. Once the Node has been marked as DECOMMISSIONING or ENTERING_MAINTENAN= CE; the Node will generate a list of containers that need replication. This= list is generated by the Replica Count decisions for each container; the R= eplica Count will be computed by Replica Manager;=20 + + 4. Once the Replica Count for these containers go back to Zero, which mea= ns that we have finished with the pending replications, the containers from= this wait list will be removed. + + 5. Once the size of the waitlist reaches zero; maintenance mode or decomm= ission is complete. + + 5. We will update the node state to DECOMMISSIONED or IN_MAINTENANCE reac= hed state. + +_Replica count_ is a calculated number which represents the number of _mis= sing_ replicas. The number can be negative in case of an over-replicated co= ntainer. + + +## Calculation of the _Replica count_ (required replicas) + +### Counters / Variables + +We have 7 different datanode state and three different type of container s= tate (replicated or in-flight deletion / in-flight replication). To calcula= te the required replicas we should introduce a few variables. + +Note: we don't need to use all the possible counters but the following tab= le summarize how the counters are calculated for the following algorithm. + +For example the `maintenance` variable includes the number of the existing= replicas on ENTERING_MAINTENANCE or IN_MAINTENANCE nodes. + +Each counters should be calculated per container bases. + + Node state | Containers - in-flight deletion= | In-Flight | + --------------------------------------|--------------------------------= -|-------------------------| + HEALTHY=09 | `healthy` = | `inFlight` + STALE + DEAD + DECOMMISSIONED=09 | = | + DECOMMISSIONING | = | + ENTERING_MAINTENANCE + IN_MAINTENANCE | `maintenance` = | + +### The current replication model + +The current replication model in SCM/Ozone is very simplistic. We compute = the replication count or the number of replications that we need to do as: + +``` +Replica count =3D expectedCount - currentCount +``` + +In case the _Replica count_ is positive, it means that we need to make mor= e replicas. If the number is negative, it means that we are over replicated= and we need to remove some replicas of this container. If the Replica coun= t for a container is zero; it means that we have the expected number of con= tainers in the cluster. + +To support idempontent placement strategies we include the in-fligt replic= ations in the `currentCount`: If there are one in-flight replication proces= s and two replicas we won't start a new replication command unless the orig= inal command is timed out. + +The timeout is configured with `hdds.scm.replication.event.timeout` and th= e default value is 10 minutes. + +More preciously the current algorithm is the following: + +``` +Replica count =3D expectedCount - healthy - inFlight +``` + +### The proposed solution + +To support the notion that a user can provide hints to the replication mod= el, we propose to add two variables to the current model. + +In the new model, we propose to break the `currentCount` into the three se= parate groups. That is _Healthy nodes_, _Maintenance nodes_, and _Decommiss= ion nodes_. The new model replaces the currentCount with these three separa= te counts. The following function captures the code that drives the logic o= f computing Replica counts in the new model. The table below discusses the = input and output of this model very extensively. + +```java +/** + * Calculate the number of the missing replicas. + *=20 + * @return the number of the missing replicas. If it's less than zero, the= container is over replicated. + */ +int getReplicationCount(int expectedCount, int healthy,=20 + int maintenance, int inFlight) { + + //for over replication, count only with the healthy replicas + if (expectedCount < healthy) { + return expectedCount - healthy; + } + =20 + replicaCount =3D expectedCount - (healthy + maintenance + inFlight); + + if (replicaCount =3D=3D 0 && healthy < 1) { + replicaCount ++; + } + =20 + //over replication is already handled + return Math.max(0, replicaCount); +} + +``` + +We also need to specify two end condition when the DECOMMISSIONING node ca= n be moved to the DECOMMISSIONED state or the ENTERING_MAINTENANCE mode can= be moved to the IN_MAINTENANCE state. + +The following conditions should be true for all the containers and all the= containers on the specific node should be closed. + +From DECOMMISSIONING to DECOMMISSIONED: + + * There are at least one healthy replica + * There are at most one missing replica + +Which means that node can be decommissioned if:=20 +=20 + * all the containers with replication factor THREE have at least *one rep= lica* on a HEALTHY nodes (minimum.live.replicas) + * all the containers with replication factor THREE have at least *three* = replicas on HEALTHY/ENTERING_MAINTENENCE/IN_MAINTENANCE nodes (minimum.repl= icas) + * all the containers with replication factor ONE have on replica on a HEA= LTHY node. + + +From ENTERING_MAINTENANCE to IN_MAINTENANCE: + + * There are at least one healthy replicas + +Which means that node can be decommissioned if=20 + + * all the containers with replication factor THREE have at least *one rep= lica* on a HEALTHY nodes (minimum.live.replicas) + * all the containers with replication factor ONE have one replica on a HE= ALTHY node. + +Note: the specified numbers can be cluster-wide configurable. + +### Examples (normal cases) + +First, let's talk about the simple case where there is no over replication= or in-flight replica copy. In this case the previous + +#### All healthy + + Node with replica | Node status + ------------------|------------ + A | HEALTHY + B | HEALTHY + C | HEALHTY + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 3 + maintenance | 0 + replicaCount | 0 + =20 +The container C1 exists on machines A, B , and C. All the container report= s tell us that the container is healthy. Running the above algorithm, we g= et: + +`expected - healthy + maint. =3D 3 - (3 + 0) =3D 0` + +It means, _"we don=E2=80=99t need no replication"._ + +#### One failure + + Node with replica | Node status + ------------------|------------ + A | HEALTHY + B | HEALTHY + C | DEAD + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 2 + maintenance | 0 + replicaCount | 1 + + +The machine C has failed, and as a result, the healthy count has gone down= from `3` to `2`. This means that we need to start one replication flow.=20 + +`ReplicaCount =3D expected - healthy + maint. =3D 3 - (2 + 0) =3D 1.` + +This means that the new model will handle failure cases just like the curr= ent model. +=20 +#### One decommissioning + + Node with replica | Node status + ------------------|------------ + A | HEALTHY + B | HEALTHY + C | DECOMMISSIONING + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 2 + maintenance | 0 + replicaCount | 1 + + +In this case, machine C is being decommissioned. Therefore the healthy cou= nt has gone down to `2` , and decommission count is `1`. Since the `Replica= Count =3D expected - healthy + maint`. we have `1 =3D 3 - (2 + 0)`, this gi= ves us the decommission count implicitly. The trick here is to realize that= incrementing decommission automatically causes a decrement in the healthy = count, which allows us not to have _decommission_ in the equation explicitl= y. + +**Stop condition**: Not that if this containers is the only one on node C,= node C can be moved to the DECOMMISSIONED state.=20 + +#### Failure + decommissioning + + Node with replica | Node status + ------------------|------------ + A | HEALTHY + B | DEAD + C | DECOMMISSIONING + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 1 + maintenance | 0 + replicaCount | 2 + +Here is a case where we have a failure of a data node and a decommission o= f another data node. In this case, the container C1 needs two replica flows= to heal itself. The equation is the same and we get=20 + +`ReplicaCount(2) =3D ExpectecCount(3) - healthy(1)` + +The maintenance is still zero so ignored in this equation. + +#### 1 failure + 2 decommissioning + + + Node with replica | Node status + ------------------|------------ + A | HEALTHY + B | DECOMMISSIONING + C | DECOMMISSIONING + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 0 + maintenance | 0 + replicaCount | 3 + =20 +In this case, we have one failed data node and two data nodes being decomm= issioned. We need to get three replica flows in the system. This is achieve= d by: + +``` +ReplicaCount(3) =3D ExpectedCount(3) - (healthy(0) + maintenance(0))=20 +``` + +#### Maintenance mode + + Node with replica | Node status + ------------------|------------ + A | HEALTHY + B | HEALTHY + C | ENTERING_MAINTENANCE + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 2 + maintenance | 1 + replicaCount | 0 + =20 +This represents the normal maintenance mode, where a single machine is mar= ked as in maintenance mode. This means the following:=20 + +``` +ReplicaCount(0) =3D ExpectedCount(3) - (healthy(2) + maintenance(1) +``` + +There are no replica flows since the user has asked us to move a single no= de into maintenance mode, and asked us explicitly not to worry about the si= ngle missing node. + +**Stop condition**: Not that if this containers is the only one on node C,= node C can be moved to the IN_MAINTENANCE state.=20 + +#### Maintenance + decommissioning + + + Node with replica | Node status + ------------------|------------ + A | HEALTHY + B | DECOMMISSIONING + C | ENTERING_MAINTENANCE + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 1 + maintenance | 1 + replicaCount | 1 + =20 +*This is a fascinating case*; We have one good node; one decommissioned no= de and one node in maintenance mode. The expected result is that the replic= a manager will launch one replication flow to compensate for the node that = is being decommissioned, and we also expect that there will be no replicati= on for the node in maintenance mode.=20 + +``` +Replica Count (1) =3D expectedCount(3) - (healthy(1) + maintenance(1)) +``` +So as expected we have one replication flow in the system. + =20 +**Stop condition**: Not that if this containers is the only one in the sys= tem: + + * node C can be moved to the IN_MAINTENANCE state + * node B can not be decommissioned (we need the three replicas first) + =20 +#### Decommissioning all the replicas + + Node with replica | Node status + ------------------|------------ + A | DECOMMISSIONING + B | DECOMMISSIONING + C | DECOMMISSIONING + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 0 + maintenance | 0 + replicaCount | 3 + =20 +In this case, we deal with all the data nodes being decommissioned. The nu= mber of healthy replicas for this container is 0, and hence: + +``` +replicaCount (3) =3D expectedCount (3)- (healthy(0) + maintenance(0)). +``` + +This provides us with all 3 independent replica flows in the system. + +#### Decommissioning the one remaining replicas + + Node with replica | Node status + ------------------|------------ + A | DEAD + B | DEAD + C | DECOMMISSIONING + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 0 + maintenance | 0 + replicaCount | 3 + +We have two failed nodes and one node in Decomm. It is the opposite of cas= e Line 5, where we have one failed node and 2 nodes in Decomm. The expected= results are the same, we get 3 flows. + +#### Total failure + + Node with replica | Node status + ------------------|------------ + A | DEAD + B | DEAD + C | DEAD + + Counter | Value + ------------------ | ------------- + expectedCount | 3 + healthy | 0 + maintenance | 0 + replicaCount | 3 + =20 =20 Review comment: whitespace:end of line =20 =20 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. =20 For queries about this service, please contact Infrastructure at: users@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 285768) Time Spent: 5h (was: 4h 50m) > Design doc: decommissioning in Ozone > ------------------------------------ > > Key: HDDS-1881 > URL: https://issues.apache.org/jira/browse/HDDS-1881 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Reporter: Elek, Marton > Assignee: Elek, Marton > Priority: Major > Labels: design, pull-request-available > Time Spent: 5h > Remaining Estimate: 0h > > Design doc can be attached to the documentation. In this jira the design = doc will be attached and merged to the documentation page. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org