Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BDE48200B54 for ; Wed, 13 Jul 2016 16:43:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id BC552160A62; Wed, 13 Jul 2016 14:43:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1792B160A6A for ; Wed, 13 Jul 2016 16:43:21 +0200 (CEST) Received: (qmail 67496 invoked by uid 500); 13 Jul 2016 14:43:21 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 67428 invoked by uid 99); 13 Jul 2016 14:43:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jul 2016 14:43:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id B72B22C02B1 for ; Wed, 13 Jul 2016 14:43:20 +0000 (UTC) Date: Wed, 13 Jul 2016 14:43:20 +0000 (UTC) From: "Sunil G (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 13 Jul 2016 14:43:22 -0000 [ https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375131#comment-15375131 ] Sunil G commented on YARN-4091: ------------------------------- Thanks [~ChenGe] for the patch and detailed doc. Few initial comments, I will also share more feedback soon. *REST api comments :* 1. For REST query ending with {{activities?nodeId=node-87}} I think it may scan all nodes in that host if there are multiple NMs running on same node. correct? 2. If we are supporting above option, could we pass node names in comma separated form to {{nodeId}} like {{activities?nodeId=node-87,node-88}} , May we can define a scope here for number of node manager to query as response o/p also need to be simpler to understand. 3. For {{app-activities?appId=application_1468198570845_0022}} I think o/p is different from node ? Could you also please attach REST o/p for app and node scenario. 4. It is possible that some times we may look for relaxed scheduling by considering missed opportunities. So one round of nodes has to undergo heartbeats to get an allocation for few cases like (rack local/dflt partition from shared label) etc. Its better we add an option like collect scheduler activity for an app till missed opportunity is 0. Thoughts? 5. *General Comments :* 1. ActivityManager is a class which holds all the informations regarding scheduling activities tracker. Over the time, I think we might need to consider cases like cleanup of some out standing requests, internal aggregation to compact and re-order collected data across heartbeats. For all these cases, I think its better we can make ActivityManager as an extended service for scheduler. So it can start a thread associated with service to do all monitoring and cleanup. This is just a thought, pls feel free to share your opinion as its a good to have option. 2. I am in favor of having the current direct simple call to start/update/stop scheduling activity. But will it be better if we define an read-write interface and clearly define who will read the data, and who can write to the activity manager. On a second thought, could we raise events to ActivityManager from scheduler and we can make it asynchronous for writes. It may become more clear and simple. Thoughts? > Improvement: Introduce more debug/diagnostics information to detail out scheduler activity > ------------------------------------------------------------------------------------------ > > Key: YARN-4091 > URL: https://issues.apache.org/jira/browse/YARN-4091 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, resourcemanager > Affects Versions: 2.7.0 > Reporter: Sunil G > Assignee: Chen Ge > Attachments: Improvement on debugdiagnostic information - YARN.pdf, YARN-4091-design-doc-v1.pdf, YARN-4091.preliminary.1.patch > > > As schedulers are improved with various new capabilities, more configurations which tunes the schedulers starts to take actions such as limit assigning containers to an application, or introduce delay to allocate container etc. > There are no clear information passed down from scheduler to outerworld under these various scenarios. This makes debugging very tougher. > This ticket is an effort to introduce more defined states on various parts in scheduler where it skips/rejects container assignment, activate application etc. Such information will help user to know whats happening in scheduler. > Attaching a short proposal for initial discussion. We would like to improve on this as we discuss. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org