hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Kanter (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-5641) History for failed Application Masters should be made available to the Job History Server
Date Thu, 21 Nov 2013 22:37:35 GMT
Robert Kanter created MAPREDUCE-5641:

             Summary: History for failed Application Masters should be made available to the
Job History Server
                 Key: MAPREDUCE-5641
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5641
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: applicationmaster, jobhistoryserver
    Affects Versions: 2.2.0
            Reporter: Robert Kanter
            Assignee: Robert Kanter

Currently, the JHS has no information about jobs whose AMs have failed.  This is because the
History is written by the AM to the intermediate folder just before finishing, so when it
fails for any reason, this information isn't copied there.  However, it is not lost as its
in the AM's staging directory.  To make the History available in the JHS, all we need to do
is have another mechanism to move the History from the staging directory to the intermediate
directory.  The AM also writes a "Summary" file before exiting normally, which is also unavailable
when the AM fails.  

I propose we solve this issue by doing the following:
The Resource Manager is aware when the AM fails; when an AM fails, the RM can write a flag
file to a new “fail” directory.  The JHS periodically scans the "fail" dir for these flag
files.  When it sees one, it then looks for the History for that failed AM; if found, it copies/moves
the History to the intermediate directory, where it will be processed by the JHS normally.
 If not found, it does nothing.  Once done, the JHS can then delete the flag file.
For the Summary file, most of it is static, so we can simply have the AM write that file out
at startup (with 0 or "N/A" for dynamic fields) and then overwrite it at shutdown to get the
values for the dynamic fields as it does now.  If the AM fails, then the JHS will at least
be able to pickup the first version of the Summary file.  

This message was sent by Atlassian JIRA

View raw message