hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2865) Application recovery continuously fails with "Application with id already present. Cannot duplicate"
Date Sat, 15 Nov 2014 02:25:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213289#comment-14213289
] 

Rohith commented on YARN-2865:
------------------------------

bq. why does the rmContext still contain the application?If the RM were at standby mode, the
transitionToStandby should have cleaned the rmContext up ?
I agree in positive flow. What if trainsitionToActive throw exception after recovery is succeeded??
Recovery process adds back applications to RMContext in RMAppManager. Any service start failures
occur after recovery is completed then RMContext remain with stale applications.
Consider the below scenario execution
# RM is in StandBy state. Initial state is STANDBY
# STANDBY to ACTIVE  : 
## Recovery : All application recovery is success. RMContext has recovered applications in
it.
## Any active service start failed which throw exception back.
   RM state remain STANDBY. But here, exception handling is done i.e. only dispatcher has
been reset, but not rmcontext/metrics system. Currently, it is done at  {{stopActiveService
()}}
# STANDBY to ACTIVE : recovery fails with above exception. And it never move to ACTIVE in
further transtitionToActive command from elector unless RM gets command to STANDBY to STANDBY
and next  STANDBY to ACTIVE.

      

> Application recovery continuously fails with "Application with id already present. Cannot
duplicate"
> ----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2865
>                 URL: https://issues.apache.org/jira/browse/YARN-2865
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Rohith
>            Assignee: Rohith
>            Priority: Critical
>         Attachments: YARN-2865.patch
>
>
> YARN-2588 handles exception thrown while transitioningToActive and reset activeServices.
But it misses out clearing RMcontext apps/nodes details and ClusterMetrics and QueueMetrics.
This causes application recovery to fail.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message