hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker
Date Wed, 10 Oct 2007 20:14:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533863

Arun C Murthy commented on HADOOP-2016:

Here are relevant logs:

1. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction
for task: task_200710090910_0003_r_001792_1
2. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task:
3. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskTracker: task_200710090910_0003_r_001792_1
0.67524564% reduce > reduce
4. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskRunner: task_200710090910_0003_r_001792_1
done; removing files.
5. 2007-10-09 12:19:15,491 WARN org.apache.hadoop.mapred.TaskTracker: Unknown child task finshed:
task_200710090910_0003_r_001792_1. Ignored.
6. 2007-10-09 12:19:18,059 WARN org.apache.hadoop.mapred.TaskTracker: Progress from unknown
child task: task_200710090910_0003_r_001792_1

With particular emphasis on line #3 above, it looks like this can happen due to the fact that
a task's progress update (child-vm) got interspersed with methods which were called while
purging the task i.e. 
{{TaskTracker#purgeTask}} -> {{TaskTracker#TaskInProgress#jobHasFinished}} which then calls
{{TaskTracker#TaskInProgress#kill}} and {{TaskTracker#TaskInProgress#cleanup}}.

Unfortunately there are a couple of issues which result in this scenario:
a) {{TaskTracker#TaskInProgress#jobHasFinished}} isn't a synchronized method and hence there
is no transaction semantics for calls made from there i.e. {{TaskTracker#TaskInProgress#kill}}
and {{TaskTracker#TaskInProgress#cleanup}}. 
b) Thus the call to kill and clean can be interspersed with a call to {{TaskTracker#TaskInProgress#reportProgress}}
(as seen in the logs). This is dangerous since it is the *{{TaskTracker#TaskInProgress#cleanup}}*
call which removes the taskid from {{TaskTracker#tasks}}.
c) {{TaskTracker#TaskInProgress#reportProgress}} unconditionally marks the task's run-state
as {{RUNNING}} which clearly is wrong, since it overwrites the task's {{KILLED}} status set
in {{TaskTracker#TaskInProgress#kill}}.

Overall a combination of the above leads to the task never being removed from {{TaskTracker#runningTasks}}
which leads to the bug in question.

The way to get around is to:
a) Call {{tasks.remove(taskid)}} from {{TaskTracker#TaskInProgress#kill}} to ensure the interspersed
call to {{TaskTracker#TaskInProgress#reportProgress}} fails to wrongly update the task status
as {{RUNNING}}
b) Check to ensure the task's state is actually {{RUNNING}} before updating it's status when
the child reports in.

I'd go with (b).

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
> I ran into a situation where a speculative task was killed by the JobTracker and the
relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a
reference to that task (although the task jvm was killed). The task continued to be in RUNNING
state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition
in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message