hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1434) Dynamic add input for one job
Date Thu, 11 Feb 2010 18:12:36 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832608#action_12832608
] 

Aaron Kimball commented on MAPREDUCE-1434:
------------------------------------------

Owen,

The {{getNewInputSplits}} method proposed above requires the InputFormat to maintain state
containing the previously-enumerated InputSplits. The proposed command-line tools suggest
independent user-side processes performing the addition of files to the job, making this challenging.
Given that splits are calculated on the client, but the "true" list of input splits is held
by the JobTracker (or is/could the splits file be written to HDFS?), calculating just the
delta might be challenging.

I think it might be more reasonable if one of the following things were true:
* The client code just calls {{getInputSplits()}} again. The same algorithm is run as in "initial"
job submission, but the output list may be longer than the previous list returned by this
method. The InputFormat is responsible for ensuring that it doesn't return any fewer splits
than it did before (i.e., don't drop inputs)
* For that matter, if the input queue for a job is dynamic, I don't see why this same mechanism
couldn't be used to drop splits that are, for whatever reason, irrelevant.
* {{getNewInputSplits()}} should have the signature: {{InputSplit [] getNewInputSplits(JobContext
job, List<InputSplit> existingSplits) throws IOException, InterruptedException}}.

The latter case would present to the user a list of the existing inputs read from the existing
'splits' file for the job. That way state-tracking is unnecessary; you can just use (e.g.)
a PathFilter to disregard things already in {{existingSplits}}.

A final proposition is that users must manually specify new paths (or other arbitrary arguments
like database table names, URLs, etc) to include, in addition to the InputFormat. In which
case, it might look more sane to have:
* {{getNewInputSplits()}} should have the signature: {{InputSplit [] getNewInputSplits(JobContext
job, String... newSplitHints) throws IOException, InterruptedException}}.

The {{newSplitHints}} is effectively a user-specified argv; it can be decoded as a list of
Paths, database tables, etc., and used appropriately by the InputFormat to generate new splits.

Other question: What are the semantics of a doubly-specified split? (Especially curious about
the inexact match case, where the same file in HDFS is enumerated twice but the splits are
at different offsets) Can/should the same file be processed twice in a job?

Finally: Why does a user-disconnect timeout kill the job? That's different than the usual
case in MapReduce, where a user disconnect is not noticed by the server-side processes at
all. I would think that after a user-disconnect timeout, that declares that all the input
is added, and that the reduce phase can begin -- not that it should kill things. 

> Dynamic add input for one job
> -----------------------------
>
>                 Key: MAPREDUCE-1434
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1434
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>         Environment: 0.19.0
>            Reporter: Xing Shi
>
> Always we should firstly upload the data to hdfs, then we can analize the data using
hadoop mapreduce.
> Sometimes, the upload process takes long time. So if we can add input during one job,
the time can be saved.
> WHAT?
> Client:
> a) hadoop job -add-input jobId inputFormat ...
> Add the input to jobid
> b) hadoop job -add-input done
> Tell the JobTracker, the input has been prepared over.
> c) hadoop job -add-input status jobid
> Show how many input the jobid has.
> HOWTO?
> Mainly, I think we should do three things:
> 1. JobClinet: here JobClient should support add input to a job, indeed, JobClient generate
the split, and submit to JobTracker.
> 2. JobTracker: JobTracker support addInput, and add the new tasks to the original mapTasks.
Because the uploaded data will be 
> processed quickly, so it also should update the scheduler to support pending a map task
till Client tells the Job input done.
> 3. Reducer: the reducer should also update the mapNums, so it will shuffle right.
> This is the rough idea, and I will update it .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message