hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity
Date Sat, 05 Sep 2015 08:20:45 GMT

    [ https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731875#comment-14731875

Sunil G commented on YARN-4091:

Thank you [~leftnoteasy] for the detailed information shared. From your input and also synced
with [~rohithsharma] and [~nijel] offline, I am trying to summarize a view point for this.
Very raw information is mentioned for now in REST response in example, we ll add detailed
information later.

*Adding more diagnostics and debug information to Scheduler will help the user to get two
levels of knowledge. So If we fetch this information with 2 REST api calls, specific reason
for potential problem in scheduler can be identified and action can be taken*

*1*. What happened to an application recently in Scheduler (like status from node heartbeats)

    - application might not have got containers it asked
          Reason: Userlimit for the application has reached
    - application might still be in pending state, yet to get active.
          Reason: Am resource limit is exhausted, hence app cant be made active

*Benefit for user with this info*:  
   User will get to know the clear problem area to look for along with potential reason for
*How User can get this info*:
  Via REST api,  debug/diagnostic information can be fetched for a queue/application.
*Expected O/P*:
 queue - a:
      application : app1
              appState : RUNNING
              reasonPhrase : NA
              lastContainerAssignmentState : SKIPPED_ASSIGNMENT
              reasonPhrase : Userlimit quota is reached
      application : app2
              appState : ACCEPTED
              reasonPhrase : AM resource limit exhausted
 *2*. Data/Metrics information from scheduler which is particular to the problem identified
in 1.

    - User can fetch metrics information via REST such as the current queue cap, user limit
configured, user limit calculated within scheduler etc.
    - User can fetch metrics information via REST such as queue capacity, am resource % configured,
am resource % calculated within RM, current demand etc.

This two level information will help user to take correct measure in cluster to fix the problem,
such as increase priority of app, OR change queue of an application, OR kill some containers
in node manually OR some auto tuning from AM also.

> Improvement: Introduce more debug/diagnostics information to detail out scheduler activity
> ------------------------------------------------------------------------------------------
>                 Key: YARN-4091
>                 URL: https://issues.apache.org/jira/browse/YARN-4091
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>         Attachments: Improvement on debugdiagnostic information - YARN.pdf
> As schedulers are improved with various new capabilities, more configurations which tunes
the schedulers starts to take actions such as limit assigning containers to an application,
or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under these various
scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in scheduler
where it skips/rejects container assignment, activate application etc. Such information will
help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve on this as
we discuss.

This message was sent by Atlassian JIRA

View raw message