hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3245) Provide ability to persist running jobs (extend HADOOP-1876)
Date Fri, 08 Aug 2008 11:45:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Amar Kamat updated HADOOP-3245:
-------------------------------

    Attachment: HADOOP-3245-v5.14.patch

Had an offline discussion with Devaraj and Hemanth. Attaching a patch the incorporates their
comments which are as follows
1) Avoid safe mode by delaying the start of ipc server. The ipc server starts only after the
JobTracker has recovered. This avoids any extra coding at the tasktracker side.
2) Avoid having any registration window as TrackerExpiryThread will take care of the tracker
that were lost while the tracker was down. 
3) Remove unnecessary changes done to the Task/TaskAttempt logging with respect to passing
of counters
----
Things taken care from the todo list
1) Re-factored out the code related to recovery under RecoveryManager
----
Things that need more work/discussion
1) Is safe mode required? Whether we want to start the ipc server early is the question we
need to answer. Starting it early will allow JobClient and TaskTracker to connect to the JobTracker.
Its the JobTracker's responsibility to handle the connection. It could either throw an exception
or could reply with a _dummy_ response. Apart from the fact that the JobClient can now detect
that the JT is up but under maintenance and take some specific actions, there seems no reason
to have the ipc services running before recovery (i.e to have the safe mode)
2) W.r.t point #7 in my earlier comment ([here|https://issues.apache.org/jira/browse/HADOOP-3245?focusedCommentId=12620042#action_12620042])
it seems that the time to detect the previously killed tasks will depend on 
2.1) number of reducers
2.2) Reducers ability to report back the fetch failures
It seems we can do better by asking the trackers about the list of maps that are currently
hosted by the tracker. This is the list of tasks that the tracker that are successful. jobTracker
can now kill all the tasks that were not claimed. We feel this can be dealt in a separate
issue. 
---- 
I am currently testing the patch on a larger cluster.

> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
>                 Key: HADOOP-3245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3245
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>         Attachments: HADOOP-3245-v2.5.patch, HADOOP-3245-v2.6.5.patch, HADOOP-3245-v2.6.9.patch,
HADOOP-3245-v4.1.patch, HADOOP-3245-v5.13.patch, HADOOP-3245-v5.14.patch
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be applied
for things like jobs being able to survive jobtracker restarts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message