Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 13 Jul 2016 14:43:20 +0000 (UTC)
From: "Sunil G (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12859676.1440698597000.16816.1468421000747@Atlassian.JIRA>
In-Reply-To: <JIRA.12859676.1440698597000@Atlassian.JIRA>
References: <JIRA.12859676.1440698597000@Atlassian.JIRA> <JIRA.12859676.1440698597314@arcas>
Subject: [jira] [Commented] (YARN-4091) Improvement: Introduce more
 debug/diagnostics information to detail out scheduler activity
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 13 Jul 2016 14:43:22 -0000


    [ https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375131#comment-15375131 ] 

Sunil G commented on YARN-4091:
-------------------------------

Thanks [~ChenGe] for the patch and detailed doc.

Few initial comments, I will also share more feedback soon.

*REST api comments :*
1. For REST query ending with {{activities?nodeId=node-87}} I think it may scan all nodes in that host if there are multiple NMs running on same node. correct?
2. If we are supporting above option, could we pass node names in comma separated form to {{nodeId}} like  {{activities?nodeId=node-87,node-88}}   , May we can define a scope here for number of node manager to query as response o/p also need to be simpler to understand.
3. For {{app-activities?appId=application_1468198570845_0022}} I think o/p is different from node ? Could you also please attach REST o/p for app and node scenario.
4.   It is possible that some times we may look for relaxed scheduling by considering missed opportunities. So one round of nodes has to undergo heartbeats to get an allocation for few cases like (rack local/dflt partition from shared label) etc. Its better we add an option like collect scheduler activity for an app till missed opportunity is 0. Thoughts?
5. 


*General Comments :*
1. ActivityManager is a class which holds all the informations regarding scheduling activities tracker. Over the time, I think we might need to consider cases like cleanup of some out standing requests, internal aggregation to compact and re-order collected data across heartbeats. For all these cases, I think its better we can make ActivityManager as an extended service for scheduler. So it can start a thread associated with service to do all monitoring and cleanup. This is just a thought, pls feel free to share your opinion as its a good to have option.
2. I am in favor of having the current direct simple call to start/update/stop scheduling activity. But will it be better if we define an read-write interface and clearly define who will read the data, and who can write to the activity manager. On a second thought, could we raise events to ActivityManager from scheduler and we can make it asynchronous for writes. It may become more clear and simple. Thoughts?


> Improvement: Introduce more debug/diagnostics information to detail out scheduler activity
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-4091
>                 URL: https://issues.apache.org/jira/browse/YARN-4091
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Sunil G
>            Assignee: Chen Ge
>         Attachments: Improvement on debugdiagnostic information - YARN.pdf, YARN-4091-design-doc-v1.pdf, YARN-4091.preliminary.1.patch
>
>
> As schedulers are improved with various new capabilities, more configurations which tunes the schedulers starts to take actions such as limit assigning containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in scheduler where it skips/rejects container assignment, activate application etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve on this as we discuss.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org