Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 165B318F89 for ; Thu, 24 Dec 2015 06:23:50 +0000 (UTC) Received: (qmail 28506 invoked by uid 500); 24 Dec 2015 06:23:50 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 28454 invoked by uid 500); 24 Dec 2015 06:23:49 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 28402 invoked by uid 99); 24 Dec 2015 06:23:49 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Dec 2015 06:23:49 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id B357A2C1F60 for ; Thu, 24 Dec 2015 06:23:49 +0000 (UTC) Date: Thu, 24 Dec 2015 06:23:49 +0000 (UTC) From: "Naganarasimha G R (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070635#comment-15070635 ] Naganarasimha G R commented on YARN-3995: ----------------------------------------- Thanks for the comments [~sjlee0], IIUC 2nd point is continuation of the first idea right ? bq. I am not too knowledgeable about the NM and so not sure if this is complicated/infeasible. {{PerNodeTimelineCollectorsAuxService}} can take this responsibility so i don't see any problem to it with NM, right ? I can think of little modification on top of your idea, * Once NM notifies the Auxillary service that the app is finished (by container finished call in the existing way), {{PerNodeTimelineCollectorsAuxService}} can add move this collector to a zombie collector Map. * This map stores the last event published time for the zombie collector. * We can have one thread running to check which zombie collector is inactive for configurable time period and then remove it Thus none of the events are lost till the end. like we can keep this period as 2 mins and if the collector in the zombie list not active for 2 mins then remove it and close it ? > Some of the NM events are not getting published due race condition when AM container finishes in NM > ---------------------------------------------------------------------------------------------------- > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver > Affects Versions: YARN-2928 > Reporter: Naganarasimha G R > Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out that few of the container metrics events were failing as there will be race condition. When the AM container finishes and removes the collector for the app, still there is possibility that all the events published for the app by the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)