hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-5190) Registering/unregistering container metrics triggered by ContainerEvent and ContainersMonitorEvent are conflict which cause uncaught exception in ContainerMonitorImpl
Date Thu, 02 Jun 2016 16:31:59 GMT

    [ https://issues.apache.org/jira/browse/YARN-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311496#comment-15311496
] 

Junping Du edited comment on YARN-5190 at 6/2/16 4:31 PM:
----------------------------------------------------------

Discussed offline with [~jianhe], we think a couple of things need to get fixed here :
1. Fix the asymmetric behaviors in register()/unregisterSource() at MetricsSystemImpl that
source name is still left in {{sourceNames.map}} in DefaultMetricsSystem after unregisterSource().

2. ContainerMetrics.finished() could get called twice - one for container life cycle (involved
in YARN-4906) and the other in container monitoring life cycle. Ideally, it is better to make
sure ContainerMetrics.finished() for the same container only get called one time one place.
However, in practice, the container event life cycle and container monitor event life cycle
are independent and cannot replace each other. Alternatively, we will make sure scheduleTimerTaskForUnregistration()
only get called one time or it will be more threads of unregistration than needed.

3. In case one ContainerMetrics already get finished before (triggered as ContainerDoneTransition
by ContainerKillEvent, ContianerDoneEvent, etc.), current logic in {{ContainerMonitorImpl.updateContainerMetrics(ContainersMonitorEvent)}}
will still register metrics into DefaultMetricsSystem first (via ContainerMetrics.forContainer(...))
and unregister it from DefaultMetricsSystem soon after. This is completely unnecessary.

Will deliver a fix for three issues raised above.


was (Author: djp):
Discussed offline with [~jianhe], we think a couple of things need to get fixed here :
1. Fix the asymmetric behaviors in register()/unregisterSource() at MetricsSystemImpl that
source name is still left in {{sourceNames.map}} in DefaultMetricsSystem after unregisterSource().

2. ContainerMetrics.finished() could get called twice - one for container life cycle (involved
in YARN-4906) and the other in container monitoring life cycle. Ideally, it is better to make
sure ContainerMetrics.finished() for the same container only get called one time one place.
However, in practice, the container event life cycle and container monitor event life cycle
are independent and cannot replace each other. Alternatively, we will make sure scheduleTimerTaskForUnregistration()
only get called one time or it will be more threads of unregistration than needed.

3. In case one ContainerMetrics already get finished before (triggered by Container life cycle
event), current logic in {{ContainerMonitorImpl.updateContainerMetrics(ContainersMonitorEvent)}}
will still register metrics into DefaultMetricsSystem first (via ContainerMetrics.forContainer(...))
and unregister it from DefaultMetricsSystem soon after. This is completely unnecessary.

Will deliver a fix for three issues raised above.

> Registering/unregistering container metrics triggered by ContainerEvent and ContainersMonitorEvent
are conflict which cause uncaught exception in ContainerMonitorImpl
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-5190
>                 URL: https://issues.apache.org/jira/browse/YARN-5190
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Blocker
>         Attachments: YARN-5190.patch
>
>
> The exception stack is as following:
> {noformat}
> 310735 2016-05-22 01:50:04,554 [Container Monitor] ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:
Thread Thread[Container Monitor,5,main] threw an Exception.
> 310736 org.apache.hadoop.metrics2.MetricsException: Metrics source ContainerResource_container_1463840817638_14484_01_000010
already exists!
> 310737         at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:135)
> 310738         at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:112)
> 310739         at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> 310740         at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:212)
> 310741         at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:198)
> 310742         at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:385)
> {noformat}
> After YARN-4906, we have multiple places to get ContainerMetrics for a particular container
that could cause race condition in registering the same container metrics to DefaultMetricsSystem
by different threads. Lacking of proper handling of MetricsException which could get thrown,
the exception will could bring down daemon of ContainerMonitorImpl or even whole NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message