hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "NextGenMapReduceDevTesting" by Arun C Murthy
Date Tue, 10 May 2011 07:11:32 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "NextGenMapReduceDevTesting" page has been changed by Arun C Murthy.
The comment on this change is: v1.
http://wiki.apache.org/hadoop/NextGenMapReduceDevTesting

--------------------------------------------------

New page:
This wiki tracks developer-testing for NextGenMapReduce.

This aim of this document is to capture various failure handling scenarios for !MapReduce
applications running under YARN and the YARN framework itself.

=== Failure scenarios ===

==== User task error ====

|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)'''
||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || ||
||
|| !CapacityScheduler releases resources for queue, user and application || || ||
|| RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||

==== User task error, same task fails 4 times ====

|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)'''
||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || ||
||
|| !CapacityScheduler releases resources for queue, user and application || || ||
|| RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||
|| AM fails the !MapReduce job and exits || || ||

==== Container failure ====

===== Localization error =====

|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)'''
||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || ||
||
|| !CapacityScheduler releases resources for queue, user and application || || ||
|| RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||

===== Exceeding memory or disk limits =====

|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)'''
||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || ||
||
|| !CapacityScheduler releases resources for queue, user and application || ||
|| RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||

===== Lost map output or faulty NM Netty =====

|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)'''
||
|| Reduces report shuffle failure errors to AM || || ||
|| On sufficient fetch-failure notifications the AM re-runs map || || ||

===== User fails/kills map or reduce task =====

|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)'''
||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || ||
||
|| !CapacityScheduler releases resources for queue, user and application || || ||
||RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||

==== Node failure due to timeout or health-check error ====

|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)'''
||
|| RM fails all running containers and informs appropriate AMs || || ||
|| Shuffle failures for completed map containers... handled (aggressively?) by AM || || ||
|| AM re-runs running task-attempts and completed maps || || ||

==== !MapReduce AM failure ====

|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)'''
||
|| NM notifies RM  || || ||
|| !CapacityScheduler releases resources for queue, user and application || || ||
|| ASM recognises AM failure || || ||
|| ASM kills all running containers || || ||
|| ASM restarts !MapReduce AM || || ||
|| !MapReduce AM recovers and re-runs only non-complete tasks || || ||

==== !ResourceManager bounce ====

|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)'''
||
|| RM recovers all running AMs || || ||
|| RM recovers all running containers || || ||
|| RM rebuilds !CapacityScheduler queue & user capacities || || ||
|| !MapReduce AMs re-runs only non-complete tasks || || ||

Mime
View raw message