hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Prakash (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-560) If AM fails due to overrunning resource limits, error not visible through UI sometimes
Date Tue, 07 May 2013 17:19:16 GMT

    [ https://issues.apache.org/jira/browse/YARN-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651078#comment-13651078

Ravi Prakash commented on YARN-560:

* In the usual case of NM killing AM while it is running and before AM unregisters with RM,
the error is visible on UI. Right?
Right. After the NM has sent SIGTERM to the AM and then a SIGKILL, before the AM is able to
send an unregister to the RM, the diagnostic message is visible.

* If the AM successfully unregistered with RM, but got killed before it could cleanly exit,
that's when the error is not visible on the UI? And is the only case addressed in the patch?
When the AM gets a SIGTERM, it begins shutdown. If it is able to send an unregister message
to the RM, it sets off the RMAppAttemptImpl onto the path to FINISHING->FINISHED. Irrespective
of whether the AM got enough time to exit cleanly or not, the CONTAINER_FINISHED event from
the NM containing the diagnostic message will not effect the RMAppAttemptImpl now. Here's
the state diagram for the RMAppAttemptImpl for reference. 

We didn't like this approach much either. Could you please think over how to fix the race?
> If AM fails due to overrunning resource limits, error not visible through UI sometimes
> --------------------------------------------------------------------------------------
>                 Key: YARN-560
>                 URL: https://issues.apache.org/jira/browse/YARN-560
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.0-alpha
>            Reporter: Todd Lipcon
>            Assignee: Ravi Prakash
>            Priority: Minor
>              Labels: usability
>         Attachments: MAPREDUCE-3949.patch, RMAppAttemptImplsmall.png, YARN-560.patch,
> I had a case where an MR AM eclipsed the configured memory limit. This caused the AM's
container to get killed, but nowhere accessible through the web UI showed these diagnostics.
I had to go view the NM's logs via ssh before I could figure out what had happened to my application.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message