hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing
Date Fri, 23 Oct 2009 20:29:59 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron Kimball updated MAPREDUCE-1119:
-------------------------------------

    Attachment: MAPREDUCE-1119.patch

Attaching a patch which performs this function.

Stack traces are added to the stdout of the task itself via {{SIGQUIT}}; this naturally lets
it get collected in the {{stdout}} log of the task.

This patch modifies the API of {{TaskController}} to include a {{dumpTaskStack()}} method
that invokes {{SIGQUIT}}.

In {{DefaultTaskController}}, this is actually performed by {{ProcessTree}}.  The {{LinuxTaskController}}
will send a new opcode {{TaskCommands.QUIT_TASK_JVM}} to the {{task-controller}} module; this
sends the {{SIGQUIT}} signal itself to the client.

The existing behavior of {{TaskController.destroyTaskJVM()}} is to send {{SIGTERM}}, sleep
for {{context.sleeptimeBeforeSigkill}} and then send {{SIGKILL}}; I've modified this method
so that it goes SIGQUIT/sleep/SIGTERM/sleep/SIGKILL. The sleep is necessary after the SIGQUIT
to give the task time to actually do the stack dump before it has to handle SIGTERM.

I tested this by running some jobs which time out and verified that they got the stack dumps
in their task stdout logs; jobs which succeed do not. I did this with both the DefaultTaskController
and the LinuxTaskController. I also added a unit test to the patch which checks that evidence
of a stack dump appears in the stdout log for a task which is killed by the unit test.

While I was in the {{task-controller}} c++ module, I discovered a segfault which is also fixed
in this patch. If {{HADOOP_CONF_DIR}} isn't defined, it expects {{argv[0]}} to be the full
path to {{task-controller}} so it can find the {{conf}} dir based on this. If you just run
{{./task-controller}}, this will try to malloc a negative amount of space. I changed it to
gracefully exit with an error message in this case. (Simple fix; no unit test case.)



> When tasks fail to report status, show tasks's stack dump before killing
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1119
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>         Attachments: MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow gather a stack
dump for the task. This could be done either by sending a SIGQUIT (so the dump ends up in
stdout) or perhaps something like JDI to gather the stack directly from Java. This may be
somewhat tricky since the child may be running as another user (so the SIGQUIT would have
to go through LinuxTaskController). This feature would make debugging these kinds of failures
much easier, especially if we could somehow get it into the TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message