Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5BDD6200B51 for ; Mon, 1 Aug 2016 19:15:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5A87E160A6C; Mon, 1 Aug 2016 17:15:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 227F5160A66 for ; Mon, 1 Aug 2016 19:15:21 +0200 (CEST) Received: (qmail 44940 invoked by uid 500); 1 Aug 2016 17:15:21 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 44905 invoked by uid 99); 1 Aug 2016 17:15:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Aug 2016 17:15:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id C57552C0E3B for ; Mon, 1 Aug 2016 17:15:20 +0000 (UTC) Date: Mon, 1 Aug 2016 17:15:20 +0000 (UTC) From: "Chen Ge (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (YARN-4091) Add REST API to retrieve scheduler activity MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 01 Aug 2016 17:15:23 -0000 [ https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402442#comment-15402442 ] Chen Ge edited comment on YARN-4091 at 8/1/16 5:14 PM: ------------------------------------------------------- Thanks [~sunilg] for tests and comments. I have modified patch based on your suggestions. 1, 2, 3, 6 have addressed. For 4, *finalAllocationState* are final state for one allocation. It is possible that it fails to allocate any containers due to queue issues, so there is no chance to go into application level. If we change the name to *finalAppAllocationState*, it is not proper to describe the condition that only relates to queue. Also for 5, We think *allocationState* is meaningful. If it is accepted, it means allocation process successfully goes to next level. If it fails in queue level, we need state to indicate that. It does not always go into application level. For 7 and 9, it is helpful to add these information, but it will change a lot based on current implementations and may need further code optimization. I am afraid I could not complete it due to limited time. I believe there will be more thoughts and improvements in the future. For 8, it is missing because second app is not added into application allocation list during node heartbeat. When AM resource has not been successfully allocated, there is no activity in node heartbeat. Not to mention the activity recording for it. Thanks again for the detailed tests! was (Author: chenge): Thanks [~Sunil G] for tests and comments. I have modified patch based on your suggestions. 1, 2, 3, 6 have addressed. For 4, *finalAllocationState* are final state for one allocation. It is possible that it fails to allocate any containers due to queue issues, so there is no chance to go into application level. If we change the name to *finalAppAllocationState*, it is not proper to describe the condition that only relates to queue. Also for 5, We think *allocationState* is meaningful. If it is accepted, it means allocation process successfully goes to next level. If it fails in queue level, we need state to indicate that. It does not always go into application level. For 7 and 9, it is helpful to add these information, but it will change a lot based on current implementations and may need further code optimization. I am afraid I could not complete it due to limited time. I believe there will be more thoughts and improvements in the future. For 8, it is missing because second app is not added into application allocation list during node heartbeat. When AM resource has not been successfully allocated, there is no activity in node heartbeat. Not to mention the activity recording for it. Thanks again for the detailed tests! > Add REST API to retrieve scheduler activity > ------------------------------------------- > > Key: YARN-4091 > URL: https://issues.apache.org/jira/browse/YARN-4091 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, resourcemanager > Affects Versions: 2.7.0 > Reporter: Sunil G > Assignee: Chen Ge > Attachments: Improvement on debugdiagnostic information - YARN.pdf, SchedulerActivityManager-TestReport v2.pdf, SchedulerActivityManager-TestReport.pdf, YARN-4091-design-doc-v1.pdf, YARN-4091.1.patch, YARN-4091.2.patch, YARN-4091.3.patch, YARN-4091.4.patch, YARN-4091.5.patch, YARN-4091.5.patch, YARN-4091.preliminary.1.patch, app_activities.json, node_activities.json > > > As schedulers are improved with various new capabilities, more configurations which tunes the schedulers starts to take actions such as limit assigning containers to an application, or introduce delay to allocate container etc. > There are no clear information passed down from scheduler to outerworld under these various scenarios. This makes debugging very tougher. > This ticket is an effort to introduce more defined states on various parts in scheduler where it skips/rejects container assignment, activate application etc. Such information will help user to know whats happening in scheduler. > Attaching a short proposal for initial discussion. We would like to improve on this as we discuss. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org