hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Zhi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking
Date Sun, 31 Jul 2016 04:09:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400947#comment-15400947

Daniel Zhi commented on YARN-4676:

Transcript of recently design discussions through email:

From: Junping Du [mailto:jdu@hortonworks.com] 
Sent: Friday, July 29, 2016 10:40 AM
To: Robert Kanter; Zhi, Daniel
Cc: Karthik Kambatla; Ming Ma
Subject: Re: YARN-4676 discussion

The plan sounds reasonable to me. Thanks Robert to coordinate on this.
I just commit YARN-5434 to branch-2.8. So Daniel, please start your rebase work in your earliest
convenient time - I will try the functionality and review the code also. About previous questions,
it is better to have such a discussions (with technical details) on the JIRA so we won't lose
track in future and it is helpful to enhance your work's visibility as more audiences sooner
or later to watch this.

From: Robert Kanter <rkanter@cloudera.com>
Sent: Friday, July 29, 2016 2:20 AM
To: Zhi, Daniel
Cc: Karthik Kambatla; Junping Du; Ming Ma
Subject: Re: YARN-4676 discussion 
Sorry, I used milliseconds instead of seconds, but it sounds like my example was clear enough
despite that.  

In the interest of moving things along, let's try to get YARN-4676 committed as soon as possible,
and then take care of the remaining items as followup JIRAs.  It sounds like some of these
require more discussion, and it might be easier this way instead of holding up YARN-4676 until
everything is perfect.  So let's do this:
1.	Junping will commit YARN-5434 to add the -client|server arguments tomorrow (Friday); for
2.	Daniel will rebase YARN-4676 on top of the latest, and make any minor changes due to YARN-5434;
for 2.9+
o	Junping and I can do a final review and hopefully commit early next week
3.	We'll open followup JIRAs for the following items where we can discuss/implement more:
1.	Abstract out the host file format to allow the txt format, XML format, and JSON format
2.	Figure out if we want to change the behavior of subsequent parallel calls to gracefully
decom nodes
Does that sound like a good plan?  
Did I miss anything?

On Tue, Jul 26, 2016 at 10:50 PM, Zhi, Daniel <danzhi@amazon.com> wrote:
Karthik: the timeout currently is always in unit of second, both the command line arg and
internal variables. Robert’s example should be “-refreshNodes –t 30”.  
Robert: the timeout overwrite behavior you observed matches code logic inside NodesListManager.java.
The first “-refreshNodes -t 120” sets 120 on A. The second “-refreshNodes –t 30”
sets 30 on both A and B. In this case, there is no timeout in the exclude file so timeoutToUse
is the timeout (120 and 30), A’s timeout being updated to 30 by line 287 + 311. To some
extent, it is by design as both refreshes refer to the set of hosts inside exclude host file.
Specifically the second refresh is about both A and B instead of B only, despite your intention
is about B. In your case, if the timeout was specified inside the exclude host file, it will
work as you expected.
Although the code could be changed to not update existing node timeout unless such timeout
is from the host file (per node overwrite), I couldn’t tell quickly whether it is better
given possible other side effect. This is something we can evaluate more.    
252   private void handleExcludeNodeList(boolean graceful, Integer timeout) {
276           // Use per node timeout if exist otherwise the request timeout.
277           Integer timeoutToUse = (timeouts.get(n.getHostName()) != null)?
278               timeouts.get(n.getHostName()) : timeout;
283           } else if (s == NodeState.DECOMMISSIONING &&
284                      !Objects.equals(n.getDecommissioningTimeout(),
285                          timeoutToUse)) {
286             LOG.info("Update " + nodeStr + " timeout to be " + timeoutToUse);
287             nodesToDecom.add(n);
311         e = new RMNodeDecommissioningEvent(n.getNodeID(), timeoutToUse);
From: Karthik Kambatla [mailto:kasha@cloudera.com] 
Sent: Tuesday, July 26, 2016 6:08 PM
To: Robert Kanter; Zhi, Daniel
Cc: Junping Du; Ming Ma
Subject: Re: YARN-4676 discussion
Related, but orthogonal. 

Do we see value in using milliseconds for the timeout? Should it be seconds?
On Tue, Jul 26, 2016 at 5:55 PM Robert Kanter <rkanter@cloudera.com> wrote:
I spoke with Junping and Karthik earlier today.  We agreed that to simplify support for client-side
tracking and server-side tracking, we should add a required -client|server argument.  We'll
add this in 2.8 to be compatible with the server-side tracking, when it's ready (probably
2.9); in the meantime, the -server flag will throw an Exception.  This will help simplify
things a little in that client-side tracking and server-side tracking are now mutually exclusive.
For example
   yarn rmadmin -refreshNodes -g 1000 -client
will do client-side tracking, while
   yarn rmadmin -refreshNodes -g 1000 -server
will do server-side tracking (once committed).
I've created YARN-5434 to implement the -client|server argument (and fix the timeout argument
not actually being optional).  @Junping, please review it.
Once we get YARN-5434 committed, @Daniel, can you update YARN-4676 to support this?  I think
that should only require minimal changes to the CLI, and there was that 5 second overhead
we probably don't need anymore.  
Long term, I agree that being able to specify the nodes on the CLI would be easier.  I think
that's something we should revisit later, once we get the current work done.  
One other thing I noticed, and just double-checked, is that the behavior of subsequent graceful
decom commands causes the previous nodes to update their timeout.  An example, to clarify:
Setup: nodeA and nodeB are in the include file
1. Start a long-running job that has containers running on nodeA
2. Add nodeA to the exclude file
3. Run '-refreshNodes -g 120000' (2min) to begin gracefully decommissioning nodeA.  The CLI
blocks, waiting.
4. Wait 30 seconds.  Above CLI command should have 90 seconds left
4. Add nodeB to the exclude file
5. Run 'refreshNodes -g 30000' (30sec) in another window
6. Wait 29 seconds.  Original CLI command should have 91 seconds left, the new command should
have 1 second
7. After 1 more second, both nodeA and nodeB shutdown and both CLI commands exit
It would be better if the second command had no bearing on the first command.  So in the above
example, only nodeB should have shutdown at the 7th step, while nodeA should have shut down
90 seconds later.  Otherwise, you can't concurrent graceful decoms.  If the user wants to
change the timeout of an existing decom, they can recommission it, and then gracefully decommission
it again with the new timeout.
Does that make sense?  Did I do something wrong here?  
On Tue, Jul 26, 2016 at 9:55 AM, Zhi, Daniel <danzhi@amazon.com> wrote:
DecommissioningNodesWatcher only tracks nodes in DECOMMISSIONING state during the decommissioning
period. It supposes to be near zero cost for all other nodes and times.
Upon RM restart, NodesListManager will decommission all hosts that appear in the exclude host
file. That behavior remains the same without or with YARN-4676. So if RM restart in the middle
of “client-side tracking”, the nodes will become decommissioned right away, instead of
waiting for client to decommission them (it does not even know there is such an client). So
overall the restart scenario support has more to do with persistent and recover node states
instead of client-side vs. server-side tracking. 
So I don’t see strong arguments of “client-side tracking” from the performance impact
and restart perspectives. That said, my real concern is how to support both in RM side so
that the code is not too complicated or at least still remain compatible. For example: the
following code inside RMNodeImpl, although appear under construction, is likely how node become
decommissioned under “client-side tracking”. It is acting on the StatusUpdate event that
is different than DecommissioningNodesWatcher acting on the NodeManager heartbeat messages.
I don’t fully understand the StatusUpdate event mechanism but found it was conflicting with
DecommissioningNodesWatcher. Are these code need to be restored within an if(clientSideTracking)
block? There could be other state transition or event related code that affects state transition,
counters, resources etc that need to be compatible and verified. Things could get tricky and
complicated, adding to the complexity of supporting both. 
1134   public static class StatusUpdateWhenHealthyTransition implements
1159       if (isNodeDecommissioning) {
1160         List<ApplicationId> runningApps = rmNode.getRunningApps();
1162         List<ApplicationId> keepAliveApps = statusEvent.getKeepAliveAppIds();
1164         // no running (and keeping alive) app on this node, get it
1165         // decommissioned.
1166         // TODO may need to check no container is being scheduled on this node
1167         // as well.
1168         if ((runningApps == null || runningApps.size() == 0)
1169             && (keepAliveApps == null || keepAliveApps.size() == 0)) {
1170           RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
1171           return NodeState.DECOMMISSIONED;
1172         }
1174         // TODO (in YARN-3223) if node in decommissioning, get node resource
1175         // updated if container get finished (keep available resource to be 0)
1176       }
From: Junping Du [mailto:jdu@hortonworks.com] 
Sent: Tuesday, July 26, 2016 7:46 AM
To: Zhi, Daniel; Ming Ma; Robert Kanter
Cc: Karthik Kambatla
Subject: Re: YARN-4676 discussion
In my understanding, the values for client tracking are:
- We may not want RM to track things which is not completely necessary in a very large scale
yarn cluster where RM's resource is already the bottleneck. Although I don't think decommissioningNodesWatcher
could consume too much resources, I would prefer to be a safe player to give user a chance
to get rid of this pressure to RM.
- As a client side tracking tool, user won't expect the timeout tracking to be solid as rock
in any case that means it must have a backup plan if tracking timeout is essentially important.
i.e. it can get tracked also by cluster management software, like - Ambari or Cloudera Manager.
That's why I would like to emphasize server-side tracking need to address RM restart case
because user's expectations is different with different price we paid.
So I would prefer we could have both ways (client-side tracking and server-side tracking)
for YARN user to choose for different scenarios. Do we have different opinion here?
I agree with Ming's point that add specifying host list in command line could be simpler and
convenient in some cases, and I also agree it could be next step as we need to figure out
way to persistent these ad-hoc nodes for decommission/recommission. Ming, would you like to
open a JIRA for this discussion? Thanks!
From: Zhi, Daniel <danzhi@amazon.com>
Sent: Tuesday, July 26, 2016 5:20 AM
To: Ming Ma; Robert Kanter
Cc: Karthik Kambatla; Junping Du; Zhi, Daniel
Subject: RE: YARN-4676 discussion 
The “server-side tracking” (client no wait) is friendly and necessary for cluster control
software that decommission/recommission hosts automatically. The “client-side tracking”
is friendly for admin to manually issue decommission command and wait for the completion.

The issue that “-refreshNodes -g timeout” is always blocking could be addressed by an
additional “-wait” or “-nowait” flag depends on the default behavior to choose. Whether
it is sufficient or not depends on the purpose “client-side tracking” --- if the core
purpose is to have blocking experience from client side and ensure forceful decommission after
timeout as necessary, then the code is already doing it. If there is other unspecified server-side
value that I am not aware of, then more changes are needed to bridge the gap (for example:
disengage  DecommissioningNodesWatcher inside RM as it will decommission the node when ready).
Overall, can anyone clarify the exact purpose and scope of “client-side tracking”?   
For the scenario mentioned by Ming where admin prefer to directly supply host in the refresh
command for convenience. I see the value to support it, however the server side logic need
merge or consolidate such hosts with ones specified in exclude/include host file. For example,
another --refreshNode could be invoked the traditional way with conflict setting about the
host inside the exclude/include file. Besides, I am not sure whether a separate command –decommissionNodes
would be better. Overall, it is an alternative convenient flavor to support, not necessarily
core to “client-side tracking” vs “server-side tracking”.   
From: Ming Ma [mailto:mingma@twitter.com] 
Sent: Tuesday, July 19, 2016 5:47 PM
To: Robert Kanter
Cc: Zhi, Daniel; Karthik Kambatla; Junping Du
Subject: Re: YARN-4676 discussion
I just went over the discussion. It sort of reminded me of discussions we had when trying
to add timeout support to HDFS datanode maintenance mode, specifically https://issues.apache.org/jira/browse/HDFS-7877?focusedCommentId=14739108&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14739108.
We considered couple approaches.
•	Have admin tools outside hadoop core to manage timeout. There is no hadoop API to support
timeout. The external tools need to track the state such as expiration and call hadoop to
change the node to the final state when timeout.
•	Have hadoop service manage the timeout without persistence. Admin issues refreshNodes
with timeout value and namenode keeps track of the expiration but will loose the data upon
•	Have hadoop service manage the timeout with persistence. It means the timeout or expiration
timestamp needs to be persisted somewhere.
Unless something else comes up, we plan to go with the last approach to support HDFS maintenance
state. Admins have to use host files to config the expiration timestamp. (we discussed the
idea of allowing admins to specify host name and timeout as part of dfsadmin -refreshNode,
but that will require some major change in terms of how HDFS persists datanode properties.)
For yarn graceful decommission, it is similar in terms of the approaches we can take.
>From admin API's point of view, there are two approaches we can take:
•	Configure everything in the hosts file, including whether it is regular decommission or
graceful decommission at individual host level. Thus client no longer needs to send RPC graceful
decommission flag to RM.
•	Configure everything in refreshNode command such as "rmadmin -refreshNode -g -t 200 -host
hostFoo". Our admins like this. This allows automation tool to decommission nodes from anything
machines. However, this introduces new complexity in terms of how to persists node properties.
This might be the long term solution for later.
Maybe we can also get input from Yahoo and Microsoft folks in terms of the admin usage by
continuing the discussion in the jira?
Thanks all. Overall this feature is quite useful and we look forward to it.
On Fri, Jul 15, 2016 at 4:22 PM, Robert Kanter <rkanter@cloudera.com> wrote:
+Karthik, who I've also been discussing this with internally.
On Thu, Jul 14, 2016 at 6:41 PM, Robert Kanter <rkanter@cloudera.com> wrote:
Pushing the JSON format to a followup JIRA sounds like a good idea to me.  I agree that we
should focus on keeping the scope limited enough for YARN-4676 so we can move forward.  
Besides having a way to disable/enable server-side tracking, I think it would make sense to
have a way to disable/enable client-side tracking.  Daniel goes through the different scenarios,
and there's a gap here.  What if the user wants to set a non-default timeout, but use server-side
tracking only?  If they specify the timeout on the commandline, they'll have to do ctrl-c,
which isn't the greatest user experience and not script-friendly.  
Perhaps we should have it so that client-side tracking is normally done, but if you specify
a new optional argument (e.g. "--no-client-tracking" or something) it won't.  This way, we
can support client-side only tracking and server-side only tracking (I suppose this would
allow no tracking, but that wouldn't work right).  
Alternatively, we could make client-side and server-side tracking mutually exclusive, which
might simplify things.  The default is client-side tracking only, but if you specify a new
optional argument (e.g. "--server" or something), it will do server-side tracking only.  
I'm kind of leaning towards making them mutually exclusive because the behavior is easier
to understand and more well defined.
- Robert
On Thu, Jul 14, 2016 at 2:49 PM, Zhi, Daniel <danzhi@amazon.com> wrote:
(added Robert to the thread and updated subject)
Regarding client-side tracking, let me describe my observation of the behavior to ensure we
are all on the same page. I have ran the “-refreshNodes -g [timeout in seconds]” command
before during the patch iterations inside real cluster (except for the last patch on 06/12/2016
where I couldn’t build AMI as normal due to Java 8 related conflict with other component
in our AMI rolling process).
I assume the interesting part for this conversation is when timeout value is specified. It
will hit the refreshNodes(int timeout) function in RMAdminCLI.java which I pasted below for
convenience. The logic submit RefreshNodesRequest request with the user timeout, which will
be interpreted by the server side code as the default timeout (override the one in yarn-site.xml
file). For individual node with explicit timeout specified in exclude host file, the specific
timeout will be used. 
The server-side tracking supports (is compatible with) multiple overlapping refreshNodes requests.
For example, at time T1, user might want graceful decommission 10 nodes, while they are still
pending in decommissioning at time T2, user changed mind to decommission 5 nodes instead (5
nodes will be moved from exclude file and RM will bring them back to running), later at time
T3, user are free to graceful decommission more or less node as they wish. The enhanced logic
in NodesListManager.java compare the user intention and current state of the RMNode and issue
proper actions (including dynamically modify the timeout) through events. 
The client-side tracking code will force decommission the nodes (line 362) upon timeout. The
timeout was purely tracked by client before (client must keep running to enforce it). If the
client somehow is terminated (for example Ctrl+C by user) before the timeout, then the timeout
is not enforced. Further, if the same command is run again, the timeout clock resets. Further,
user might have issued refreshNodes() without explicit timeout before, during which the default
timeout will be used by RM, a new refreshNodes(timeout) request with client-side tracking
might not lead to the desired timeout.
The current code (below) tries to balance the client-side tracking and server side logic so
that they are mostly compatible. For the simple scenario where only client-side tracking is
used (assume this is Junping’s scenario), I expect the user experience be similar. The client
code below waits for 5 extra seconds before force decommission which usually not needed as
the node will become decommissioned before or upon the timeout in RM side. The client will
finish earlier should the node become decommissioned earlier. I have verified this is the
behavior before but not in 06/12.

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, GracefulDecommissionYarnNode.pdf,
YARN-4676.004.patch, YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch,
YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch, YARN-4676.012.patch, YARN-4676.013.patch,
YARN-4676.014.patch, YARN-4676.015.patch, YARN-4676.016.patch
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager automatically
> status of all affected nodes to kicks out decommission or recommission actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to decommission
> nodes immediately after there are ready to be decommissioned. Decommissioning timeout
at individual
> nodes granularity is supported and could be dynamically updated. The mechanism naturally
supports multiple
> independent graceful decommissioning “sessions” where each one involves different
sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful decommission
request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks DECOMMISSIONING nodes
status automatically and asynchronously after client/admin made the graceful decommission
request. It tracks DECOMMISSIONING nodes status to decide when, after all running containers
on the node have completed, will be transitioned into DECOMMISSIONED state. NodesListManager
detect and handle include and exclude list changes to kick out decommission or recommission
as necessary.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message