hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-4325) Purge app state from NM state-store should cover more LOG_HANDLING cases
Date Tue, 10 May 2016 19:52:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278751#comment-15278751
] 

Jason Lowe edited comment on YARN-4325 at 5/10/16 7:51 PM:
-----------------------------------------------------------

I'm just thinking the explicit boolean check and special-case logic is a bit ugly compared
to the typical flow.  If we simply changed the log handlers so they don't ignore events and
always send a response then I don't think we need the special-case tracking.  For example,
if the log handlers receive an event for an app they are no longer tracking (because the app
log handling failed to init or whatever) then it immediately sends back the APPLICATION_LOG_HANDLING_FAILED
or APPLICATION_LOG_HANDLING_FINISHED event.  Then we can have the app state machine always
clean up in the final finished state as normal rather than having special-case removal logic
in other states.


was (Author: jlowe):
I'm just thinking the explicit boolean check and special-case logic is a bit ugly compared
to the typical flow.  If we simply changed the log handlers so they dont ignore events and
always send a response then I don't think we need the special tracking. gets an event.  For
example, if the log handlers receive an event for an app they are no longer tracking (because
the app log handling failed to init or whatever) then it immediately sends back the APPLICATION_LOG_HANDLING_FAILED
or APPLICATION_LOG_HANDLING_FINISHED event.  Then we can have the app state machine always
clean up in the final finished state as normal rather than having special-case removal logic
in other states.

> Purge app state from NM state-store should cover more LOG_HANDLING cases
> ------------------------------------------------------------------------
>
>                 Key: YARN-4325
>                 URL: https://issues.apache.org/jira/browse/YARN-4325
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: ApplicationImpl.PNG, YARN-4325-v1.1.patch, YARN-4325-v1.patch, YARN-4325.patch
>
>
> From a long running cluster, we found tens of thousands of stale apps still be recovered
in NM restart recovery. 
> After investigating, there are three issues cause app state leak in NM state-store:
> 1. APPLICATION_LOG_HANDLING_FAILED is not handled with remove App in NMStateStore.
> 2. APPLICATION_LOG_HANDLING_FAILED event is missing in sent when hit aggregator's doAppLogAggregation()
exception case.
> 3. Only Application in FINISHED status receiving APPLICATION_LOG_FINISHED has transition
to remove app in NM state store. Application in other status - like APPLICATION_RESOURCES_CLEANUP
will ignore the event and later forget to remove this app from NM state store even after app
get finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message