hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing
Date Wed, 11 Nov 2009 20:54:39 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776648#action_12776648
] 

Aaron Kimball commented on MAPREDUCE-1119:
------------------------------------------

I modeled this parameter after the fact that TaskTracker was already using this same name
(see {{setTaskFailState()}}, {{jobHasFinished()}}, {{kill()}}, {{cleanUpOverMemoryTask()}})
to indicate whether a kill was failure-based or for other purposes (cleanup/preemption/etc).

I think there is perhaps a more systemic overhaul of failure reason tracking that should occur
as a separate issue?

As for your table... if you look at {{TaskController.destroyTaskJVM()}} (line 151), you can
see that generates-stack is true iff {{wasFailure}} is true.

I ran some tests by running sleep job which slept for 60 seconds in each call to {{map()}}.
Results follow:

|*Test case*|*Stack dump?*|
|set {{mapreduce.task.timeout}} to 10000 (task timeout)|yes|
|ran {{bin/mapred job -kill-task}} on attempts|no|
|ran {{bin/mapred job -fail-task}} on attempts|no|
|Let it complete successfully|no|
|ran {{bin/mapred job -kill}} on the job itself.|no|
|threw a RuntimeException in the mapper|no|

Thus, I believe that translates into the following for your table:

|*Reason*|*wasFailure*|*generates stack*|
|Child exception|maybe|maybe*|
|Other task failures|false|false|
|Task timeout|true|true|
|Task killed by user|false|false|
|Task failed by user|false|false|
|Job killed by user|false|false|

Looking at {{org.apache.hadoop.mapred.Child}}, there are a few different catch blocks in there:
* If a task throws a {{FSError}}, this triggers {{TaskUmbilicalProtocol.fsError()}}, which
will cause a {{purgeTask(tip, wasFailure=true)}}.
* If a task throws any other sort of {{Exception}}, this does not trigger a particular response
to the TUP; The exception string is passed to {{TaskTracker.reportDiagnosticInfo()}}, but
this simply logs a string of text and takes no further action.
* If a map task throws any other {{Throwable}}, this triggers {{TUP.fatalError()}}, which
also calls {{purgeTask(tip, wasFailure=true)}}.

But immediately after these catch blocks, it closes the RPC proxy, shuts down the logging
thread, and exits the JVM. So fsError and fatalError *may* cause a stack dump if the TT processes
the request fast enough and issues a SIGQUIT in the next few microseconds. But this is racing
against the fact that the child task's next action is "exit immediately."

Note that job kill, task timeout, and the task exception cases are all covered in the unit
test provided in this patch.


> When tasks fail to report status, show tasks's stack dump before killing
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1119
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow gather a stack
dump for the task. This could be done either by sending a SIGQUIT (so the dump ends up in
stdout) or perhaps something like JDI to gather the stack directly from Java. This may be
somewhat tricky since the child may be running as another user (so the SIGQUIT would have
to go through LinuxTaskController). This feature would make debugging these kinds of failures
much easier, especially if we could somehow get it into the TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message