hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1374) TaskTracker falls into an infinite loop.
Date Fri, 25 May 2007 17:31:16 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499174
] 

Arun C Murthy commented on HADOOP-1374:
---------------------------------------

Konstantin, as per your attached logs one of your task-trackers was 'lost' (it takes 10mins
to declare it to be 'lost'), the tasks were rescheduled to the other tracker and your job
completed fine (as per the jobtracker logs)...

Ok, I've racked my brains on this one and let me try and explain what I think is happening
and potentially one short-term fix to ease our lives... fasten your seat-belts please:

a) MapTask completes and we see the 'done' message from {{TaskTracker:reportDone}}
b) However {{TaskTracker.reportDone}} only notes that the task is *done* by setting a boolean
(but *does not* mark the {{TaskInProgress.runstate}} as {{SUCCEEDED}}).
c) The child jvm, for whatever reason (maybe a windows peculiarity) doesn't 'exit' (might
be due to stray non-daemon threads etc.). Thus {{TaskRunner.runChild}}'s {{process.waitFor}}
is hung, and hence {{TaskRunner.run}} cannot call {{TaskTracker.reportTaskFinished}} which
is the place which sets {{TaskInProgress.runstate}} to {{SUCCEEDED}}.
d) *10 mins* later {{TaskTracker.markUnresponsiveTasks}} marks this task as 'unresponsive'
and kills it. However this might be too late since the junit test case is killed for (possibly)
over-running it's 15mins limit and we have a failed test case.

Phew! Hope that makes sense, it looks like we might have to figure out why the child-jvm isn't
exiting in the first place. So far other than that there isn't a bug IMO.

One option is to reduce those timeouts from 10mins to 3/5mins for the test-cases and things
should swim along fine for now, while we continue to try and figure out this one for 0.14.0
or 0.13.1 if possible... does that sound reasonable? Nigel?




> TaskTracker falls into an infinite loop.
> ----------------------------------------
>
>                 Key: HADOOP-1374
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1374
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: DataNode1.log, DataNode2.log, JobTracker.log, NameNode.log, TaskTracker1.log,
TaskTracker2.log, TestDFSIO.log
>
>
> All maps had been completed successfully. I had only one reduce task during which
> TaskTracker infinitely outputs:
> 07/05/15 19:35:41 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667% reduce >
copy (4 of 8 at 0.00 MB/s) > 
> 07/05/15 19:35:42 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667% reduce >
copy (4 of 8 at 0.00 MB/s) > 
> 07/05/15 19:35:43 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667% reduce >
copy (4 of 8 at 0.00 MB/s) > 
> 07/05/15 19:35:44 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667% reduce >
copy (4 of 8 at 0.00 MB/s) > 
> 07/05/15 19:35:45 INFO mapred.TaskTracker: task_0001_r_000000_0 0.16666667% reduce >
copy (4 of 8 at 0.00 MB/s) > 
> JobTracker does not log anything about task task_0001_r_000000_0 except for
> 07/05/15 19:49:01 INFO mapred.JobTracker: Adding task 'task_0001_r_000000_0' to tip tip_0001_r_000000,
for tracker 'tracker_my-host.com:50050'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message