hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1434) Dynamic add input for one job
Date Tue, 09 Feb 2010 19:05:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831593#action_12831593
] 

Owen O'Malley commented on MAPREDUCE-1434:
------------------------------------------

+1

It helps a more interesting use case where you have a pipeline of mapreduce jobs and don't
want the 2nd set of maps to wait until the last reduce finishes. It would be great in job
control could use this as an optimization.

You need to have a method where the application declares that all of the input has been added.
To avoid having reduces holding slots that they can't use, I'd suggest that no reduces should
be launched until the input is complete.

A timeout is also required so that if a user disappears the job is killed after N minutes
of no new input and not having the input complete.

> Dynamic add input for one job
> -----------------------------
>
>                 Key: MAPREDUCE-1434
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1434
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>         Environment: 0.19.0
>            Reporter: Xing Shi
>
> Always we should firstly upload the data to hdfs, then we can analize the data using
hadoop mapreduce.
> Sometimes, the upload process takes long time. So if we can add input during one job,
the time can be saved.
> WHAT?
> Client:
> a) hadoop job -add-input jobId inputFormat ...
> Add the input to jobid
> b) hadoop job -add-input done
> Tell the JobTracker, the input has been prepared over.
> c) hadoop job -add-input status jobid
> Show how many input the jobid has.
> HOWTO?
> Mainly, I think we should do three things:
> 1. JobClinet: here JobClient should support add input to a job, indeed, JobClient generate
the split, and submit to JobTracker.
> 2. JobTracker: JobTracker support addInput, and add the new tasks to the original mapTasks.
Because the uploaded data will be 
> processed quickly, so it also should update the scheduler to support pending a map task
till Client tells the Job input done.
> 3. Reducer: the reducer should also update the mapNums, so it will shuffle right.
> This is the rough idea, and I will update it .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message