hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod K V (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-733) When running ant test TestTrackerBlacklistAcrossJobs, losing task tracker heartbeat exception occurs.
Date Wed, 08 Jul 2009 13:21:14 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728684#action_12728684
] 

Vinod K V commented on MAPREDUCE-733:
-------------------------------------

Just looked at the code causing this. This happens whenever there is an attempt to unreserve
a job's tasks from a TaskTracker even though the reservation is for a job other than this
job. This supposedly must have been done during MAPREDUCE-516 itself, but unfortunately missed
(https://issues.apache.org/jira/browse/MAPREDUCE-516?focusedCommentId=12721792&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12721792).

The resultant behavior is that when a task fails, one heartbeat of the TT is missed, but the
next heartBeat passes through. This is because the first heartBeat marks the task as FAILED
on the JobTracker and so the faulty code isn't invoked for the same TT again in further heartBeats.
This leaves inconsistent state on the JT, for e.g, immediately following this is the code
for creation of task completion event which would never be created for this task. This issue
HAS to be fixed immediately because of the side effects.

One more thing I've observed while going through this is that reservations are not removed
on a TaskTracker that is globally blacklisted either via large task-failure count or via unhealthy
status.

> When running ant test TestTrackerBlacklistAcrossJobs, losing task tracker heartbeat exception
occurs. 
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-733
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-733
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>            Reporter: Iyappan Srinivasan
>
> When running ant test TestTrackerBlacklistAcrossJobs, losing task tracker heartbeat.

> It seems when a  task tracker is killed , it throws exception. Instead it should catch
it and process it and allow the rest of the flow to go through.
> 2009-07-08 11:58:26,116 INFO  ipc.Server (Server.java:run(973)) - IPC Server handler
7 on 40193, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@13ec758, false, false,
true, 6) from 127.0.0.1:40200: error: java.io.IOException: java.lang.RuntimeException: tracker_host1.rack.com:localhost/127.0.0.1:40197
already has slots reserved for null; being asked to un-reserve for job_200907081158_0001
> java.io.IOException: java.lang.RuntimeException: tracker_host1.rack.com:localhost/127.0.0.1:40197
already has slots reserved for null; being asked to un-reserve for job_200907081158_0001
>         at org.apache.hadoop.mapreduce.server.jobtracker.TaskTracker.unreserveSlots(TaskTracker.java:162)
>         at org.apache.hadoop.mapred.JobInProgress.addTrackerTaskFailure(JobInProgress.java:1580)
>         at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:2908)
>         at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1025)
>         at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3869)
>         at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3081)
>         at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2819)
>         at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:960)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:958)
> 2009-07-08 11:58:26,162 INFO  mapred.TaskTracker (TaskTracker.java:transmitHeartBeat(1196))
- Resending 'status' to 'localhost' with reponseId '6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message