hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4716) testRestartWithLostTracker frequently times out
Date Thu, 27 Nov 2008 13:54:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651357#action_12651357
] 

Amar Kamat commented on HADOOP-4716:
------------------------------------

The JobTracker upon restart rebuilds the _task-completion-event_ list. Here there are events
from the tracker which was lost upon restart. When the task-tracker (re)connects it re-sizes
its own _task-completion-event_ list. Hence the tracker retains the missing map's events.
After some time the jobtracker finds out that the tracker is lost and kills all the maps that
were run on the lost tracker and re-executes them. The tracker will have the _task-completion-event_
list like 
{code}
1. SUC m1-t1
2. SUC m2-t2
3. SUC m3-t1
4. SUC m4-t2
5. KIL m1-t1
6. KIL m3-t1
7. SUC m1-t2
8. SUC m3-t2
{code}
The reducer takes _m1-t1_ and starts pulling map output from _t1_. Note that when the reducer
fails on _m1_ it checks that _m1_ is _OBSOLETE_ and then ignores it. The test case times out
because it takes fair amount of time (~3mins) to fail once. So this doesnt look like a bug
but a limitation. The reason this issue is not commonly seen  is because the reducer actually
starts late and hence the tracker has the latest updates which prevents the reducer to take
up maps from the lost tracker. I could easily reproduce this problem when the reducer was
scheduled early. 
----
One thing that can be done here is to make _num-reducers=0_ as the test case doesnt actually
require reducers. But actually its better to have reducers as it makes the testcase strict
and hence better. So if we decide to keep reducers then there should be some way to control
the timeout (~3min --> ~5 secs). Thoughts?

> testRestartWithLostTracker frequently times out
> -----------------------------------------------
>
>                 Key: HADOOP-4716
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4716
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Johan Oskarsson
>            Assignee: Amar Kamat
>            Priority: Minor
>             Fix For: 0.20.0
>
>         Attachments: log.txt
>
>
> This test frequently times out: org.apache.hadoop.mapred.TestJobTrackerRestartWithLostTracker.testRestartWithLostTracker
> Example: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3637/testReport/org.apache.hadoop.mapred/TestJobTrackerRestartWithLostTracker/testRestartWithLostTracker/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message