hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4729) job history UI not showing all job attempts
Date Thu, 01 Nov 2012 15:09:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488741#comment-13488741
] 

Jason Lowe commented on MAPREDUCE-4729:
---------------------------------------

I tried testing the patch with a sleep job using -Dyarn.app.mapreduce.am.job.recovery.enable=false
and manually killing the ApplicationMaster with a kill -9, but it didn't work.  The log showed
this exception:

{noformat}
2012-11-01 14:37:01,543 WARN [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Could
not parse the old history file. Will not have old AMinfos 
java.io.IOException: Incompatible event log version: null
	at org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:70)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.readJustAMInfos(MRAppMaster.java:915)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.start(MRAppMaster.java:846)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1143)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1378)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1139)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1098)
{noformat}

It looks like the AM is buffering the history file output, and we didn't flush out the AMInfos
from previous runs.  When I used a normal kill instead of kill -9, it worked.  We will want
to flush/sync the job history file after writing the AMInfos to help guard against unclean
teardowns losing prior AM attempts in the history.  This can be fixed in a separate JIRA if
we don't want to fix it here.

Couple of other comments on the patch:
* Application attempts start from 1 instead of 0, so the first attempt tries to recover AMInfos
when it shouldn't and leads to a large FileNotFoundException stacktrace being logged
* Nit: In RecoveryService.parse there's an extra space logged before a comma.  {{LOG.info("Got
an error parsing job-history file "}} should be {{LOG.info("Got an error parsing job-history
file"}}
* Nit: The body of the while loop in readJustAMInfos could be a bit cleaner with fewer conditionals.
 For example:
{code}
      while ((event = jobHistoryEventReader.getNextEvent()) != null) {
        if (event.getEventType() == EventType.AM_STARTED) {
          amStartedEventsBegan = true;
          AMStartedEvent amStartedEvent = (AMStartedEvent) event;
          amInfos.add(MRBuilderUtils.newAMInfo(
            amStartedEvent.getAppAttemptId(), amStartedEvent.getStartTime(),
            amStartedEvent.getContainerId(),
            StringInterner.weakIntern(amStartedEvent.getNodeManagerHost()),
            amStartedEvent.getNodeManagerPort(),
            amStartedEvent.getNodeManagerHttpPort()));
        } else if (amStartedEventsBegan) {
          // This means AMStartedEvents began and this event is a
          // non-AMStarted event.
          // No need to continue reading all the other events.
          break;
        }
      }
{code}
                
> job history UI not showing all job attempts
> -------------------------------------------
>
>                 Key: MAPREDUCE-4729
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4729
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 0.23.3
>            Reporter: Thomas Graves
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: MAPREDUCE-4729-20121031.txt
>
>
> We are seeing a case where a job runs but the AM is running out of memory in the first
3 attempts. The job eventually finishes on the 4th attempt.  When you go to the job history
UI for that job, it only shows the last attempt.  This is bad since we want to see why the
first 3 attempts failed.
> The RM web ui shows all 4 attempts. 
> Also I tested this locally by running "kill" on the app master and in that case the history
server UI does show all attempts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message