hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4381) Add container launchEvent and container localizeFailed metrics in container
Date Mon, 07 Dec 2015 11:13:11 GMT

    [ https://issues.apache.org/jira/browse/YARN-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044770#comment-15044770
] 

Junping Du commented on YARN-4381:
----------------------------------

bq. That's NodeManagerMetrics#containersLaunched is not actually means the container succeed
launched times.
At the beginning, containersLaunched doesn't means container launched with successful. I don't
see there is some problem here. For container launch failed metrics, do we need to differentiate
it with failed container (including failed after running) metrics? If so, what's the use case
for this metrics? In addition, container get launched failed doesn't always means localizeFailed,
why we are focus on localization only? Last but not the least, what's different with launchEvent
with containersLaunched? We normally not putting metrics on events, as theoretically, all
events could be sent duplicated and it should be designed as idempotent in most cases.

> Add container launchEvent and container localizeFailed metrics in container
> ---------------------------------------------------------------------------
>
>                 Key: YARN-4381
>                 URL: https://issues.apache.org/jira/browse/YARN-4381
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.7.1
>            Reporter: Lin Yiqun
>            Assignee: Lin Yiqun
>         Attachments: YARN-4381.001.patch
>
>
> Recently, I found a issue on nodemanager metrics.That's {{NodeManagerMetrics#containersLaunched}}
is not actually means the container succeed launched times.Because in some time, it will be
failed when receiving the killing command or happening container-localizationFailed.This will
lead to a failed container.But now,this counter value will be increased in these code whenever
the container is started successfully or failed.
> {code}
> Credentials credentials = parseCredentials(launchContext);
>     Container container =
>         new ContainerImpl(getConfig(), this.dispatcher,
>             context.getNMStateStore(), launchContext,
>           credentials, metrics, containerTokenIdentifier);
>     ApplicationId applicationID =
>         containerId.getApplicationAttemptId().getApplicationId();
>     if (context.getContainers().putIfAbsent(containerId, container) != null) {
>       NMAuditLogger.logFailure(user, AuditConstants.START_CONTAINER,
>         "ContainerManagerImpl", "Container already running on this node!",
>         applicationID, containerId);
>       throw RPCUtil.getRemoteException("Container " + containerIdStr
>           + " already is running on this node!!");
>     }
>     this.readLock.lock();
>     try {
>       if (!serviceStopped) {
>         // Create the application
>         Application application =
>             new ApplicationImpl(dispatcher, user, applicationID, credentials, context);
>         if (null == context.getApplications().putIfAbsent(applicationID,
>           application)) {
>           LOG.info("Creating a new application reference for app " + applicationID);
>           LogAggregationContext logAggregationContext =
>               containerTokenIdentifier.getLogAggregationContext();
>           Map<ApplicationAccessType, String> appAcls =
>               container.getLaunchContext().getApplicationACLs();
>           context.getNMStateStore().storeApplication(applicationID,
>               buildAppProto(applicationID, user, credentials, appAcls,
>                 logAggregationContext));
>           dispatcher.getEventHandler().handle(
>             new ApplicationInitEvent(applicationID, appAcls,
>               logAggregationContext));
>         }
>         this.context.getNMStateStore().storeContainer(containerId, request);
>         dispatcher.getEventHandler().handle(
>           new ApplicationContainerInitEvent(container));
>         this.context.getContainerTokenSecretManager().startContainerSuccessful(
>           containerTokenIdentifier);
>         NMAuditLogger.logSuccess(user, AuditConstants.START_CONTAINER,
>           "ContainerManageImpl", applicationID, containerId);
>         // TODO launchedContainer misplaced -> doesn't necessarily mean a container
>         // launch. A finished Application will not launch containers.
>         metrics.launchedContainer();
>         metrics.allocateContainer(containerTokenIdentifier.getResource());
>       } else {
>         throw new YarnException(
>             "Container start failed as the NodeManager is " +
>             "in the process of shutting down");
>       }
> {code}
> In addition, we are lack of localzationFailed metric in container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message