hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-7053) Timed out tasks can fail to produce thread dump
Date Wed, 14 Feb 2018 22:12:00 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-7053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe updated MAPREDUCE-7053:
    Status: Patch Available  (was: Open)

Yeah, this is yet another latent bug that was exposed when the task attempt listener starts
rejecting status updates for tasks the AM no longer thinks is running.

As such I'm proposing a fix where we do *not* immediately reject attempts that the AM thinks
should not be running, but rather give them a grace period of sorts.  This patch adds the
ability of the task heartbeat handler to track attempts that have unregistered recently. 
It uses the same grace period for unregistered tasks that is currently used for tasks that
have unregistered via the umbilical and are shutting down gracefully.  This keeps the AM from
immediately rejecting a recently unregistered attempt, allowing that attempt to receive a
stack dump signal and otherwise shut down cleanly by itself.  After the grace period expires,
it will reject status updates.

> Timed out tasks can fail to produce thread dump
> -----------------------------------------------
>                 Key: MAPREDUCE-7053
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7053
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Major
>         Attachments: MAPREDUCE-7053.001.patch
> TestMRJobs#testThreadDumpOnTaskTimeout has been failing sporadically recently.  When
the AM times out a task it immediately removes it from the list of known tasks and then connects
to the NM to request a thread dump followed by a kill.  If the task heartbeats in after the
task has been removed from the list of known tasks but before the thread dump signal arrives
then the task can exit with a "org.apache.hadoop.mapred.Task: Parent died." message and no
thread dump.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org

View raw message