hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3245) Provide ability to persist running jobs (extend HADOOP-1876)
Date Fri, 27 Jun 2008 19:37:45 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Amar Kamat updated HADOOP-3245:
-------------------------------

    Attachment: HADOOP-3245-v2.6.5.patch

 Attaching a patch for review. Following are the changes
1) The bug discussed [here|https://issues.apache.org/jira/browse/HADOOP-3245?focusedCommentId=12604620#action_12604620]
is taken care of. The reducer on resetting wont skip any task completion events. Duplicate
events for a tip but from different attempts will also be added. The reduce task seems to
take care of it.

2) The job directories for restored jobs are checked for completeness before adding to the
queue. A job directory is considered complete when it has _job.xml, job.jar and job.split_.

3) There was one corner case where the jobtracker dies with a job as completed (job dir missing)
before communicating to the tasktracker i.e task trackers still have the task statuses for
the completed job. The way this is handled is that the jobtracker on receiving an update request
for a missing job will ask all the TTs to clear this job's details.

4) Restart mode turned off : The restart mode is turned off after some time. This is useful
as we dont want the JT to entertain latecomers. The JT comes out of restart mode using the
following equation
{{current-time > last-time-when-a-tt-synced + lost-task-tracker-interval}}
This somehow will make sure that we dont close the registration too early. 

5) The web ui now shows the restart information. It shows whether the JT is still recovering
and the time it has taken to recover.
----
Issues taken care of :
1) Consider the following case :
    Reducers belonging to the old JT are still shuffling a map m while the jt gets restarted.
m gets re-executed on a different host, say m'. Consider m' checking in before m. Since m
checks in later, it gets killed. The reducer which fetches from m now start failing. Here
the fetch failure notification will have no effect on the jt and hence there are no false
notifications.
2) Backlisting of a tracker per job is based on the task failures on that host. Failed statuses
are not cleared from the running jobs on the tracker and hence will be replayed as per the
design.
3) If a TIP has failed earlier, it will fail again since all the failed task statuses will
be replayed.
----
Known issues : 
1) I have seen jobs getting stuck. I tried hard to reproduce it but I couldn't. Will keep
testing the patch.
2) The job runtime will change as the runtime is calculated based on the time the job is created
at the jobtracker. With restarted jobtracker the old start time will be lost.
3) The task attempt id is now changed. It requires the jobtracker's start time and hence it
might affect the task output filters. Also application outside the framework would not be
able to _guess_ the attempt id which they anyways should not be able to.


> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
>                 Key: HADOOP-3245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3245
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>         Attachments: HADOOP-3245-v2.5.patch, HADOOP-3245-v2.6.5.patch
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be applied
for things like jobs being able to survive jobtracker restarts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message