hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Prakash (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes
Date Wed, 03 Apr 2013 19:07:15 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621199#comment-13621199
] 

Ravi Prakash commented on MAPREDUCE-3949:
-----------------------------------------

The race seems to be in between receiving a RMAppAttemptContainerFinishedEvent in RMContainerImpl.java's
FinishedTransition and a FinishApplicationMasterRequest in ApplicationMasterService. Any preferences
on how to fix it? A couple of options come to my mind:
1. Make the AM not send the FinishApplicationMasterRequest when it detects (if it can) that
the NM is killing it.
2. Have the NM contact the RM before killing an AM container so that when the AM does send
the FinishApplicationMasterRequest, the RM knows to ignore it.
3. Make the RMAppAttemptEventType.CONTAINER_FINISHED change the state of the AppAttempt even
after FinishApplicationMasterRequest has changed the state to FINISHING / KILLED.

What do you think?
                
> If AM fails due to overrunning resource limits, error not visible through UI sometimes
> --------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3949
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3949
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.24.0, 0.23.2
>            Reporter: Todd Lipcon
>            Assignee: Ravi Prakash
>            Priority: Minor
>
> I had a case where an MR AM eclipsed the configured memory limit. This caused the AM's
container to get killed, but nowhere accessible through the web UI showed these diagnostics.
I had to go view the NM's logs via ssh before I could figure out what had happened to my application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message