hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-828) Provide a mechanism to pause the jobtracker
Date Thu, 06 Aug 2009 06:35:14 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739914#action_12739914

Hemanth Yamijala commented on MAPREDUCE-828:

Some initial thoughts on implementation:
- From the time the pause command is issued, the JT will process heartbeats by sending back
a special response to the TTs indicating its state.
- This special command will cause the TTs to replay their message as if the original message
was not received by the JT.
- The above is similar to what happens today if TTs fail to communicate with the JT.
- The JT will not process any data sent by the TTs during a paused state. IOW job state will
not change. Since this data will be replayed until resuming, it will not be lost and can be
picked up once the JT is resumed.
- The ExpireLaunchingTasks thread will pause as well, since the status of tasks last launched
before the pause will not be updated.
- The CleanupQueue thread on the JT which deletes files from the DFS will also be paused as
it might fail DFS deletes.

Thoughts on the requirements / proposal ?

> Provide a mechanism to pause the jobtracker
> -------------------------------------------
>                 Key: MAPREDUCE-828
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-828
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: jobtracker
>            Reporter: Hemanth Yamijala
> We've seen scenarios when we have needed to stop the namenode for a maintenance activity.
In such scenarios, if the jobtracker (JT) continues to run, jobs would fail due to initialization
or task failures (due to DFS). We could restart the JT enabling job recovery, during such
scenarios. But restart has proved to be a very intrusive activity, particularly if the JT
is not at fault itself and does not require a restart. The ask is for a admin-controlled feature
to pause the JT which would take it to a state somewhat analogous to the safe mode of DFS.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message