Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B87018CC6 for ; Mon, 7 Dec 2015 11:13:45 +0000 (UTC) Received: (qmail 855 invoked by uid 500); 7 Dec 2015 11:13:11 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 806 invoked by uid 500); 7 Dec 2015 11:13:11 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 745 invoked by uid 99); 7 Dec 2015 11:13:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Dec 2015 11:13:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 0D45F2C1F5C for ; Mon, 7 Dec 2015 11:13:11 +0000 (UTC) Date: Mon, 7 Dec 2015 11:13:11 +0000 (UTC) From: "Junping Du (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4381) Add container launchEvent and container localizeFailed metrics in container MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044770#comment-15044770 ] Junping Du commented on YARN-4381: ---------------------------------- bq. That's NodeManagerMetrics#containersLaunched is not actually means the container succeed launched times. At the beginning, containersLaunched doesn't means container launched with successful. I don't see there is some problem here. For container launch failed metrics, do we need to differentiate it with failed container (including failed after running) metrics? If so, what's the use case for this metrics? In addition, container get launched failed doesn't always means localizeFailed, why we are focus on localization only? Last but not the least, what's different with launchEvent with containersLaunched? We normally not putting metrics on events, as theoretically, all events could be sent duplicated and it should be designed as idempotent in most cases. > Add container launchEvent and container localizeFailed metrics in container > --------------------------------------------------------------------------- > > Key: YARN-4381 > URL: https://issues.apache.org/jira/browse/YARN-4381 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 2.7.1 > Reporter: Lin Yiqun > Assignee: Lin Yiqun > Attachments: YARN-4381.001.patch > > > Recently, I found a issue on nodemanager metrics.That's {{NodeManagerMetrics#containersLaunched}} is not actually means the container succeed launched times.Because in some time, it will be failed when receiving the killing command or happening container-localizationFailed.This will lead to a failed container.But now,this counter value will be increased in these code whenever the container is started successfully or failed. > {code} > Credentials credentials = parseCredentials(launchContext); > Container container = > new ContainerImpl(getConfig(), this.dispatcher, > context.getNMStateStore(), launchContext, > credentials, metrics, containerTokenIdentifier); > ApplicationId applicationID = > containerId.getApplicationAttemptId().getApplicationId(); > if (context.getContainers().putIfAbsent(containerId, container) != null) { > NMAuditLogger.logFailure(user, AuditConstants.START_CONTAINER, > "ContainerManagerImpl", "Container already running on this node!", > applicationID, containerId); > throw RPCUtil.getRemoteException("Container " + containerIdStr > + " already is running on this node!!"); > } > this.readLock.lock(); > try { > if (!serviceStopped) { > // Create the application > Application application = > new ApplicationImpl(dispatcher, user, applicationID, credentials, context); > if (null == context.getApplications().putIfAbsent(applicationID, > application)) { > LOG.info("Creating a new application reference for app " + applicationID); > LogAggregationContext logAggregationContext = > containerTokenIdentifier.getLogAggregationContext(); > Map appAcls = > container.getLaunchContext().getApplicationACLs(); > context.getNMStateStore().storeApplication(applicationID, > buildAppProto(applicationID, user, credentials, appAcls, > logAggregationContext)); > dispatcher.getEventHandler().handle( > new ApplicationInitEvent(applicationID, appAcls, > logAggregationContext)); > } > this.context.getNMStateStore().storeContainer(containerId, request); > dispatcher.getEventHandler().handle( > new ApplicationContainerInitEvent(container)); > this.context.getContainerTokenSecretManager().startContainerSuccessful( > containerTokenIdentifier); > NMAuditLogger.logSuccess(user, AuditConstants.START_CONTAINER, > "ContainerManageImpl", applicationID, containerId); > // TODO launchedContainer misplaced -> doesn't necessarily mean a container > // launch. A finished Application will not launch containers. > metrics.launchedContainer(); > metrics.allocateContainer(containerTokenIdentifier.getResource()); > } else { > throw new YarnException( > "Container start failed as the NodeManager is " + > "in the process of shutting down"); > } > {code} > In addition, we are lack of localzationFailed metric in container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)