hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szilard Nemeth (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
Date Thu, 02 Aug 2018 15:27:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566890#comment-16566890
] 

Szilard Nemeth edited comment on YARN-4946 at 8/2/18 3:26 PM:
--------------------------------------------------------------

DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is finished.

If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the APP_COMPLETED event
need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node status update
4. If there are any log aggregation reports then {{RmNode#handleLogAggregationStatus}} is
called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished and then send
the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or rejected, e.g. RMAppImpl
goes from the RUNNING to FINISHED state (RMAppEventType.ATTEMPT_FINISHED), no matter what
happens in {{FinalTransition}}, the app will reach a terminal state (FINISHED in this case).
If I would use a break statement as described above, the app would be in a FINISHED state
which is not right as the rest of the code in the transition could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as before and if
log aggregation is not finished yet, I don't send the APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, {{RMAppImpl#aggregateLogReport}} will be
called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to the {{RMAppManager}}
if the application is in a final state.



was (Author: snemeth):
DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is finished.

If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the APP_COMPLETED event
need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node status update
4. If there are any log aggregation reports then {{RmNode#handleLogAggregationStatus}} is
called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished and then send
the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or rejected, e.g. RMAppImpl
goes from the RUNNING to FINISHED state (RMAppEventType.ATTEMPT_FINISHED), no matter what
happens in {{FinalTransition}}, the app will reach a terminal state (FINISHED in this case).
If I would use a break statement as described above, the app would be in a FINISHED state
which is not right as the rest of the code in the transition could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as before and if
log aggregation is not finished yet, I don't send the APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, {{RMAppImpl.aggregateLogReport}} will be
called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to the {{RMAppManager}}
if the application is in a final state.


> RM should not consider an application as COMPLETED when log aggregation is not in a terminal
state
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4946
>                 URL: https://issues.apache.org/jira/browse/YARN-4946
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: log-aggregation
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-4946.001.patch, YARN-4946.002.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each Yarn App
into a HAR file.  When run, it seeds the list by looking at the aggregated logs directory,
and then filters out ineligible apps.  One of the criteria involves checking with the RM that
an Application's log aggregation status is not still running and has not failed.  When the
RM "forgets" about an older completed Application (e.g. RM failover, enough time has passed,
etc), the tool won't find the Application in the RM and will just assume that its log aggregation
succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed from its history)
until the aggregation status has reached a terminal state (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message