hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vrushali C (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
Date Wed, 23 Dec 2015 20:55:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070185#comment-15070185
] 

Vrushali C commented on YARN-3995:
----------------------------------

Hi [~Naganarasimha]

Thanks for the thoughts on the jira. I was wondering if the following is a feasible solution:

- can the NM container maintain a list/map info of  “zombie app ids” for AMs/collectors
that it is removing?  That way when metrics arrive at the NM from other NMs for those zombie
app ids, it can see if this was for an app that previously had a collector and hence most
likely still a valid metric/entity and then somehow write that to the backend, perhaps via
a “common parent collector” process or something.

- we can have the NM periodically prune  this zombie list, perhaps say a few days after app
completion, remove the info for that app from the zombie app list.

I am not too knowledgeable about the NM and so not sure if this is complicated/infeasible.



> Some of the NM events are not getting published due race condition when AM container
finishes in NM 
> ----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3995
>                 URL: https://issues.apache.org/jira/browse/YARN-3995
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>              Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out that few
of the container metrics events were failing as there will be race condition. When the AM
container finishes and removes the collector for the app, still there is possibility that
all the events published for the app by the current NM and other NM are still in pipeline,




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message