hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Kanter (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking
Date Thu, 09 Jun 2016 00:25:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321698#comment-15321698

Robert Kanter commented on YARN-4676:

Sorry [~danzhi] for disappearing for a bit there.  I got sidetracked with some other responsibilities.
 Thanks [~vvasudev] for your detailed comments too.  Here's some additional comments on the
latest patch (14):

# The Patch doesn't cleanly apply to the current trunk
#- I did rollback my repo to an older time when the patch does apply cleanly, but some tests
 Time elapsed: 12.43 sec  <<< FAILURE!
java.lang.AssertionError: Node state is not correct (timedout) expected:<DECOMMISSIONING>
but was:<SHUTDOWN>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:743)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:727)
	at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1474)
	at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1413)

 Time elapsed: 3.184 sec  <<< FAILURE!
java.lang.AssertionError: Node should have been forgotten! expected:<host2:5678> but
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:743)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1586)
	at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalGracefully(TestResourceTrackerService.java:1421)
# I like [~vvasudev]'s suggestion in an [earlier comment|https://issues.apache.org/jira/browse/YARN-4676?focusedCommentId=15272554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15272554]
about having the RM tell the NM to do a delayed shutdown.  This keeps the RM from having to
track anything and we don't have to worry about RM failovers; the RM also has less stuff to
keep track of.  And I think it would be a lot simpler to implement and maintain.  I'd suggest
that we do that in this JIRA instead of a followup JIRA because otherwise we'll commit a bunch
of code here, just to throw it out later.  
# In {{HostsFileReader#readXmlFileToMapWithFileInputStream}}, you can replace the multiple
{{catch}} blocks with a single {{catch}} using this syntax:
catch(IOException|SAXException|ParserConfigurationException e) {
# I also agree with [~vvasudev] on point 7 about the exit-wait.ms property.  This seems like
a separate feature, so if you still want it, I'd suggest creating a separate JIRA with just

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, GracefulDecommissionYarnNode.pdf,
YARN-4676.004.patch, YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch,
YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch, YARN-4676.012.patch, YARN-4676.013.patch,
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager automatically
> status of all affected nodes to kicks out decommission or recommission actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to decommission
> nodes immediately after there are ready to be decommissioned. Decommissioning timeout
at individual
> nodes granularity is supported and could be dynamically updated. The mechanism naturally
supports multiple
> independent graceful decommissioning “sessions” where each one involves different
sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful decommission
request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks DECOMMISSIONING nodes
status automatically and asynchronously after client/admin made the graceful decommission
request. It tracks DECOMMISSIONING nodes status to decide when, after all running containers
on the node have completed, will be transitioned into DECOMMISSIONED state. NodesListManager
detect and handle include and exclude list changes to kick out decommission or recommission
as necessary.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message