hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5079) Recovery should restore task state from job history info directly
Date Wed, 10 Apr 2013 00:30:16 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627311#comment-13627311

Siddharth Seth commented on MAPREDUCE-5079:

Overall, looks pretty good. Comments and mostly minor nits
- TaskAttemptImpl.needToClean - actual cleanup seems to be buried inside the "KILLED" code
block. Does this need to be moved out ?
- JobImpl.startTansition - there's an unnecessary null check on 'jse'
- processRecovery can be changed to check the appAttemptId up front, and change the logged
message. The message saying "Will not try to recover" on attempt 1 can be a little confusing.
- JobHistoryUtils is likely a better place for getPreviousJobHistoryFileStream
- TaskImpl.sendTaskSucceededEvent - has a log message about killing remaining containers.
This isn't needed here.
- Conversion from attaemptState to TaskAttemptCompletionEventStatus - this is a direct string
conversion based on the fact that only a select set of values are used (FAILED/KILLED/SUCCEEDED).
Would prefer this conversion being explicit, with an error on an incorrect state. This won't
happen now, but may simplify things for future changes.
- MRJobConfig.MR_AM_JOB_RECOVERY_ENABLE - can we add a default value in MRJobConfig instead
of leaving it in the MRAppMaster.

For the unit tests,
- some formatting fixes are required. > 80 column width, spaces etc.
- testRecoveryTaskSuccessAllAttemptsFail could also verify that a new attempt is created
- testRecoveryTaskSuccessAllAttemptsSucceed could verify the correct event being sent to the
job as a CompletionEvent
- ApplicationId, JobId construction via BuilderUtils, MRBuilderUtils instead of directly
-  when(mockTAinfo.getState()).thenReturn("RUNNING"); //?? - RUNNING can be replaced with
a random string to be less confusing. This is a user process reported string afaik.

> Recovery should restore task state from job history info directly
> -----------------------------------------------------------------
>                 Key: MAPREDUCE-5079
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5079
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am
>    Affects Versions: 0.23.7
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-5079-branch-0.23.patch, MAPREDUCE-5079.patch, MAPREDUCE-5079.patch,
> We've encountered a lot of hanging issues during MR-AM recovery because the state machines
don't always end up in the same states after recovery.  This is especially true when speculative
execution is enabled.  It should be straightforward to restore task and task attempt states
directly from the TaskInfo and TaskAttemptInfo records in the job history file to avoid relying
on the task state machines ending up in the proper states with the proper number of attempts.
> This should be a more robust solution that would also give us the option of recovering
start time and log locations for tasks that were in-progress when the AM crashed.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message