hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing
Date Tue, 10 Nov 2009 18:47:27 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776003#action_12776003
] 

Aaron Kimball commented on MAPREDUCE-1119:
------------------------------------------

Actually, I suppose that if it comes from the JT, then it's definitely a speculative task
attempt, right? Task attempt timeouts are actually between the attempt and the TT, and the
JT isn't involved at all.

In the event of a timeout, markUnresponsiveTasks() calls TaskTracker.purgeTask(tip, wasFailure=true)
calls tip.jobHasFinished(wasFailure) which calls tip.kill(wasFailure).

Unfortunately, here's where the train of failure/non-failure data for why the task should
be killed, ends. This calls TaskRunner.kill() which calls JvmManager.taskKilled(this), which
calls JvmManagerForType.taskKilled(taskRunner), calls JvmMgrForType.killJvm(jvmId), calls
JvmRunner.kill(), calls TaskController.destroyTaskJvm(TaskControllerContext). (Someone please
correct me if I'm wrong.)

But TaskRunner.kill() doesn't get a reason code like wasFailure. This could be changed, but
then we'd also need to modify JvmManager, and add a synchronized/volatile call to hand off
this data into the TaskControllerContext object. Is all this worth it just to avoid stack
dumps in aborted speculative task attempts?


> When tasks fail to report status, show tasks's stack dump before killing
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1119
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow gather a stack
dump for the task. This could be done either by sending a SIGQUIT (so the dump ends up in
stdout) or perhaps something like JDI to gather the stack directly from Java. This may be
somewhat tricky since the child may be running as another user (so the SIGQUIT would have
to go through LinuxTaskController). This feature would make debugging these kinds of failures
much easier, especially if we could somehow get it into the TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message