hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carlo Curino (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6451) Create a monitor to check whether we maintain RM (scheduling) invariants
Date Mon, 17 Apr 2017 19:30:41 GMT

    [ https://issues.apache.org/jira/browse/YARN-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971530#comment-15971530

Carlo Curino commented on YARN-6451:

Thanks [~chris.douglas] for the feedback.

 # I did the precompilation as you suggested (I didn't know the Javascrip engine is a {{Compilable}}
subclass of the general {{ScriptEngine}} one), it helps somewhat. Poking at performance, I
also found that the longer I ran it the slower it got... it was due to the collector accumulating
records. I know clear it at each iteration. Combined this brought us down to about 1ms per
iteration if we keep all invariant separate (one per line of our script file), and *0.07ms
per invocation* if we combine them in a single large invariant (with all individual invariants
in && ). 
Pros and cons, when invariants are violated the log line is harder to read if combined, but
perf is much better. In the current example of {{invariants.txt}} I will leave this with one
invariant per line, so slower but easier to understand---works?

# I added this to the logging/exception message. In particular, I am pruning the bindings,
so that the message should contain only the bindings used in the failing invariant (bar performance
tricks above, this makes for a very readable output).

# As we discussed offline, while it is true we could push the checking deep into the collector
and get a little closer to detect the issues to when they happen, since we run say every second
with this, it is unlikely we will improve detection much (we shave sub-millis time, but we
might still be 0.5sec off in average from when the violation occurred). Short of checking
at every metrics update (very costly), we probably can only detect issues a little after they
have happened. This seems anyway much better than days later when a customer complains :-)

> Create a monitor to check whether we maintain RM (scheduling) invariants
> ------------------------------------------------------------------------
>                 Key: YARN-6451
>                 URL: https://issues.apache.org/jira/browse/YARN-6451
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Carlo Curino
>            Assignee: Carlo Curino
>         Attachments: YARN-6451.v0.patch, YARN-6451.v1.patch, YARN-6451.v2.patch
> For SLS runs, as well as for live test clusters (and maybe prod), it would be useful
to have a mechanism to continuously check whether core invariants of the RM/Scheduler are
respected (e.g., no priority inversions, fairness mostly respected, certain latencies within
expected range, etc..)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message