hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3245) Provide ability to persist running jobs (extend HADOOP-1876)
Date Tue, 10 Jun 2008 05:07:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603772#action_12603772

Amar Kamat commented on HADOOP-3245:

There is a deficiency with the _SYNC_ algo mentioned above. The _task completion events_ in
the JT will now be reordered. The copy of _task completion events_ that the TT has is now
garbled. Consider the following cases with the SYNC algo in place

1) A TT has reducers in _SHUFFLE_ phase : Here the TT will have a partial copy of the previous
version of the _task completion events_ and hence will require the latest copy.

2) A TT has all the reducers in _REDUCE_ phase before the JT restart and new reduce tasks
assigned to it after the restart : Here the TT will have a complete copy of the previous version
of the _task completion events_ and hence it looks like it might not require the latest copy
of _completion events_. But consider a case where some maps were lost and hence their output
is not available. Since the ordering of the _task completion events_ is different and some
events in the JT might belong to re-executed maps, there is no good way to inform the TTs
that a particular map was lost and it should use the new _task completion event_.  Currently
(with trunk) this is not a issue because there will be just one copy of the _task completion
events_ in the lifetime of the job. Hence the TT will always have the correct ordering of
the _completion events_. In case of re-executions the new event will always be at the end.
One simple solution would be to force the re-build of _completion events_ copy at the TT on
_SYNC_ action.

> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>                 Key: HADOOP-3245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3245
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
> This could probably extend the work done in HADOOP-1876. This feature can be applied
for things like jobs being able to survive jobtracker restarts.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message