hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haibo Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6870) Add configuration for MR job to finish when all reducers are complete (even with unfinished mappers)
Date Sun, 30 Jul 2017 22:38:02 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106683#comment-16106683

Haibo Chen commented on MAPREDUCE-6870:

Thanks [~pbacsko] for reminding me. A few more comments after taking a close look.
1) The new configuration is called mapreduce.job.map.preempt-on-reduce-finish, which is a
little deceptive in that what we are really doing is to let the job finish. Preempting any
running mapper is just part of doing that. How about we rename it to something like mapreduce.job.finish-when-all-reducers-done?
Also add a static variable in MRJobConfig to represent its default value 'true' and documentation
in mapred-default.xml for this new configuration.

2) In preemptMappersIfNecessary(), we can reference job.mapTasks directly and get rid of the
type check

3) With the current flow, that is, we execute preemptMappersIfNecessary upon every task completion
event, redundant T_KILL events can be generated. Say there are 3 mappers that are still running
when all reducers are done, we would send 3 T_KILL events first. When one of the 3 mappers,
whichever is the first, is killed, it triggers another Task completion event and we send 2
T_KILL events. In total, we'd send 3+2+1 events. I think we want to make sure we do preemption
of running mappers only once.

4) The statement in checkJobAfterTaskCompletion()
      if (job.preemptRestartedMappersOnReduceFinish) {
is hard to follow without referring to this jira. For readability, we can put this in a new
method, maybe called checkReadyForCompletionWhenAllReducersDone(), inline whatever we have
in preemptMappersIfNecessary(). Then have comments on this method explain what we are doing
here and why. 

5) There are a few unused import statements in TestJobImpl.  TestJobImpl.createSpiedTasks()
is rather createSpiedMapTasks, so we can rename that.  conf.set(MRJobConfig.PREEMPT_MAPPERS_ON_REDUCE_FINISH,
Boolean.toString(killMappers)) can be replaced with conf.setBoolean(,). 

6) In the newly added unit test, we are just verifying that the mapper are killed. Similar
to 1), we want to finish the job, so I think we should verify job completion first if our
new configuration is set to true. 

> Add configuration for MR job to finish when all reducers are complete (even with unfinished
> ----------------------------------------------------------------------------------------------------
>                 Key: MAPREDUCE-6870
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6870
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.6.1
>            Reporter: Zhe Zhang
>            Assignee: Peter Bacsko
>         Attachments: MAPREDUCE-6870-001.patch, MAPREDUCE-6870-002.patch
> Even with MAPREDUCE-5817, there could still be cases where mappers get scheduled before
all reducers are complete, but those mappers run for long time, even after all reducers are
complete. This could hurt the performance of large MR jobs.
> In some cases, mappers don't have any materialize-able outcome other than providing intermediate
data to reducers. In that case, the job owner should have the config option to finish the
job once all reducers are complete.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org

View raw message