hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod K V (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing
Date Mon, 16 Nov 2009 10:26:41 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778288#action_12778288
] 

Vinod K V commented on MAPREDUCE-1119:
--------------------------------------

The patch looks very clean now! Thanks! It is very close, I have only few comments on the
latest patch, most of them minor:
 - Care explain the changes to {{src/c++/task-controller/main.c}} w.r.t conf_dir_len? Both
for my confirmation as well as for record's sake..
 - Change c comments for {{kill_user_task()}} in {{src/c++task-controller/task-controller.c}}
mentioning that it can terminate/kill or dump-stack?
 - Now that the semantics have changed, I am not very sure we want to use the same configuration
property for sleeping after dump-stack. (Thinking aloud..) Do we even need a sleep here? The
signalling order is SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order
of their arrival? If so, then we will not another sleep. If not, we may need a sleep here,
but may or may not be driven by the same config item. What do you think?
 - All the three newly added methods in {{JvmManager}} can be package-private or private.
 - ProcessTree.java:
   -- Lot of refactoring. Nice!
   -- The variables SIG* and SIG*_STR can all be private, so can {{maybeSignalProcess()}}
and {{maybeSignalProcessGroup()}} be.
 - TestJobKillAndFail
   -- Are we sure "PSPermGen" will always be there in the dump? Instead how about passing
our own {{TaskController}} that does custom actions in {{TaskController.dumpStacks()}}, simplifying
our verification that dump-stack is indeed called?
   -- The test now takes very long time. The test-time can be more than halved if we set max-map-attempts
to one in both the tests via {{conf.setMaxMapAttempts(1);}}
 - We need a similar test for {{LinuxTaskController}} to test stack-dump when multiple users
are involved. You can look at {{TestLocalizationWithLinuxTaskController}} and/or {{TestJobExecutionAsDifferentUser}}
for inspiration.

> When tasks fail to report status, show tasks's stack dump before killing
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1119
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, MAPREDUCE-1119.4.patch,
MAPREDUCE-1119.5.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow gather a stack
dump for the task. This could be done either by sending a SIGQUIT (so the dump ends up in
stdout) or perhaps something like JDI to gather the stack directly from Java. This may be
somewhat tricky since the child may be running as another user (so the SIGQUIT would have
to go through LinuxTaskController). This feature would make debugging these kinds of failures
much easier, especially if we could somehow get it into the TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message