hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carlo Curino (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-6451) Create a monitor to check whether we maintain RM (scheduling) invariants
Date Thu, 06 Apr 2017 23:04:41 GMT

    [ https://issues.apache.org/jira/browse/YARN-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959937#comment-15959937
] 

Carlo Curino edited comment on YARN-6451 at 4/6/17 11:04 PM:
-------------------------------------------------------------

The patch provides and  initial implementation of this idea. It does the following simple
thing:

Every time the InvariantsChecker is invoked it: 
 # poll the QueueMetrics (we could/should extend it and make it configurable)
 # it checks a list of invariants (loaded from config file)
 # logs any error as a warning

The idea is to use this in few ways: 
 # For SLS-based unit/integration tests that ensure correctness of the overall RM subsystem.
 E.g., running for a while, and checking that important invariants are never violated (e.g.,
resource being non-negative, or locality going from usually good to very bad after a check-in).

 # Performance-based analysis via SLS (and fixed environments), e.g., allocation-latency starting
to get worse after a certain change.
 # In production environments to "anticipate" customer griping.

An extension of this is to make the "action" triggered when an invariant is violated configurable,
e.g., in some cases a log is all is needed, while other times one may want an alert, or even
a system.exit() if things are really bad (and/or the deployment allows it).

[~wangda], [~jlowe], [~kasha], [~subru], [~kkaranasos], [~asuresh], [~chris.douglas], [~hrsharma],
[~roniburd], [~kishorch]: Thoughts?





was (Author: curino):
The patch provides and  initial implementation of this idea. It does the following simple
thing:

Every time the InvariantsChecker is invoked it: 
 # poll the QueueMetrics (we could/should extend it and make it configurable)
 # it checks a list of invariants (loaded from config file)
 # logs any error as a warning

The idea is to use this in few ways: 
 # For SLS-based unit/integration tests that ensure correctness of the overall RM subsystem.
 E.g., running for a while, and checking that important invariants are never violated (e.g.,
resource being non-negative, or locality going from usually good to very bad after a check-in).

 # Performance-based analysis via SLS (and fixed environments), e.g., allocation-latency starting
to get worse after a certain change.
 # In production environments to "anticipate" customer griping.

An extension of this is to make the "action" triggered when an invariant is violated configurable,
e.g., in some cases a log is all is needed, while other times one may want an alert, or even
a system.exit() if things are really bad (and/or the deployment allows it).

[~wangda], [~jlowe], [~kasha], [~subru], [~kkaranasos], [~asuresh], [~chris.douglas]: Thoughts?




> Create a monitor to check whether we maintain RM (scheduling) invariants
> ------------------------------------------------------------------------
>
>                 Key: YARN-6451
>                 URL: https://issues.apache.org/jira/browse/YARN-6451
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Carlo Curino
>            Assignee: Carlo Curino
>         Attachments: YARN-6451.v0.patch
>
>
> For SLS runs, as well as for live test clusters (and maybe prod), it would be useful
to have a mechanism to continuously check whether core invariants of the RM/Scheduler are
respected (e.g., no priority inversions, fairness mostly respected, certain latencies within
expected range, etc..)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message