hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gera Shegalov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them
Date Mon, 24 Feb 2014 20:03:23 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Gera Shegalov updated MAPREDUCE-5044:

    Attachment: MAPREDUCE-5044.v04.patch

v04 to apply on top of YARN-1515.v05. It now makes sure that a thread dump is created in the
uber mode. 

Added unit tests for a normal MR job and uber MR job.

While working on this I realized that we actually need to discuss how mapreduce.task.timeout
is treated in the ubermode. Right now it's basically ignored because AM does not kill itself,
LocalContainerLauncher processes CONTAINER_REMOTE_CLEANUP inline with the stuck in SubtaskRunner.
 The liveness monitor for AM in RM does not catch the problem either because RMCommunicator
heartbeats in a separate allocator thread. 

I am considering two options:
- move heartbeat() into SubtaskRunner for ubermode such that the liveness monitor catches
the stuck ubertask.
- do System.exit(errorcode) when TA_TIMEOUT occurs.


> Have AM trigger jstack on task attempts that timeout before killing them
> ------------------------------------------------------------------------
>                 Key: MAPREDUCE-5044
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>            Assignee: Gera Shegalov
>         Attachments: MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, MAPREDUCE-5044.v03.patch,
MAPREDUCE-5044.v04.patch, Screen Shot 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12
at 1.06.04 PM.png
> When an AM expires a task attempt it would be nice if it triggered a jstack output via
SIGQUIT before killing the task attempt.  This would be invaluable for helping users debug
their hung tasks, especially if they do not have shell access to the nodes.

This message was sent by Atlassian JIRA

View raw message