Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DE5B82009E2 for ; Wed, 1 Jun 2016 18:03:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id DD398160A4F; Wed, 1 Jun 2016 16:03:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 39264160A45 for ; Wed, 1 Jun 2016 18:03:00 +0200 (CEST) Received: (qmail 55593 invoked by uid 500); 1 Jun 2016 16:02:59 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 55575 invoked by uid 99); 1 Jun 2016 16:02:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2016 16:02:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 414282C1F5A for ; Wed, 1 Jun 2016 16:02:59 +0000 (UTC) Date: Wed, 1 Jun 2016 16:02:59 +0000 (UTC) From: "Junping Du (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-5190) Race condition in registering container metrics cause uncaught exception in ContainerMonitorImpl MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 01 Jun 2016 16:03:01 -0000 [ https://issues.apache.org/jira/browse/YARN-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-5190: ----------------------------- Priority: Blocker (was: Critical) > Race condition in registering container metrics cause uncaught exception in ContainerMonitorImpl > ------------------------------------------------------------------------------------------------ > > Key: YARN-5190 > URL: https://issues.apache.org/jira/browse/YARN-5190 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Junping Du > Assignee: Junping Du > Priority: Blocker > > The exception stack is as following: > {noformat} > 310735 2016-05-22 01:50:04,554 [Container Monitor] ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container Monitor,5,main] threw an Exception. > 310736 org.apache.hadoop.metrics2.MetricsException: Metrics source ContainerResource_container_1463840817638_14484_01_000010 already exists! > 310737 at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:135) > 310738 at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:112) > 310739 at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > 310740 at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:212) > 310741 at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:198) > 310742 at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:385) > {noformat} > After YARN-4906, we have multiple places to get ContainerMetrics for a particular container that could cause race condition in registering the same container metrics to DefaultMetricsSystem by different threads. Lacking of proper handling of MetricsException which could get thrown, the exception will could bring down daemon of ContainerMonitorImpl or even whole NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org