hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sumit Nigam (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state.
Date Tue, 21 Jul 2015 14:47:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635202#comment-14635202

Sumit Nigam commented on YARN-3946:

Hi [~varun_saxena] - 
Yes, the idea is not to only debug the issue (which you rightly mentioned, Admin can). I am
currently on 2.6.0 and will try 2.7.0 when I can, for sure.

There are too many reasons to be able to correlate as to what may have happened - AM level,
resource level, queue level, possibly a combination of these, etc. A programmatic API is also
useful to apply corrective measures - say, I can program to submit my app to a whole new queue
altogether, etc. after I notice it is queue level capacity issue or try reserving container,
etc - all programatically!

Another important use case is that of attempting to submit the app (say, through own AM) and
after a period of remaining in ACCEPTED state, reporting back automatically as to why the
state remains so. A REST API is extremely useful in such a case. With this, it would be possible
to to even ascertain when a job moves to ACCEPTED state from RUNNING state itself (RM restart,
AM crash + restart). Again, this currently requires looking through logs / UI to ascertain
what happened. In esp big clusters, this is indeed non-trivial.

I'd agree with Nagannarasimha that we should be able to know that without administrative understanding
of the same. Plus, I am not working on this.

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state.
> ---------------------------------------------------------------------------
>                 Key: YARN-3946
>                 URL: https://issues.apache.org/jira/browse/YARN-3946
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Sumit Nigam
> Currently there is no direct way to get the exact reason as to why a submitted app is
still in ACCEPTED state. It should be possible to know through RM REST API as to what aspect
is not being met - say, queue limits being reached, or core/ memory requirement not being
met, or AM limit being reached, etc.

This message was sent by Atlassian JIRA

View raw message