hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith Sharma K S (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3999) RM hangs on draing events
Date Tue, 11 Aug 2015 11:58:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681691#comment-14681691
] 

Rohith Sharma K S commented on YARN-3999:
-----------------------------------------

Thanks [~jianhe] for updating the patch.. 
One doubt is SystemMetricsPublisher has been moved from RMActiveServices to ResourceManager.
So this service will not be reinitialized on every RM switch. Thinking that this would lead
for processing stale events even after RM is in standby. If any case, the same RM becomes
active SystemMetricsPublisher  dispatcher publishes stale events plus recovered application
events. Anyway events processing will happen in the sequential order if same RM comes back
Active. But issue may can ocure when the different RM becomes active i.e 
# RM1 is active and publishing the events
# RM1 is transitioning to standby,and some events are in the queue to be updated in the timeline
sever
# RM2 become active and recovered the applications. When application got finished, RM2 systempublisher
publishes app status as finished.
# RM1 is still processing the events for app which would process bit late i.e after RM2 processed.

Doesn't it cause problem? Any thoughts?

> RM hangs on draing events
> -------------------------
>
>                 Key: YARN-3999
>                 URL: https://issues.apache.org/jira/browse/YARN-3999
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-3999.1.patch, YARN-3999.2.patch, YARN-3999.2.patch, YARN-3999.3.patch,
YARN-3999.4.patch, YARN-3999.5.patch, YARN-3999.patch, YARN-3999.patch
>
>
> If external systems like ATS, or ZK becomes very slow, draining all the events take a
lot of time. If this time becomes larger than 10 mins, all applications will expire. Fixes
include:
> 1. add a timeout and stop the dispatcher even if not all events are drained.
> 2. Move ATS service out from RM active service so that RM doesn't need to wait for ATS
to flush the events when transitioning to standby.
> 3. Stop client-facing services (ClientRMService etc.) first so that clients get fast
notification that RM is stopping/transitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message