hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-619) Unify Map-Reduce and Streaming to take the same globbed input specification
Date Mon, 11 Dec 2006 18:32:28 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-619?page=comments#action_12457431 ] 
Doug Cutting commented on HADOOP-619:

> change the signature of InputFormatBase.areValidInputDirectories() to return a list of
valid Paths

I don't think we should do this, since it assumes that inputs are always representable by
Paths.  The long-term goal is that inputs and outputs are only assumed to be (1) specifiable
in a Configuration and (2) representable as a serializeable Split implementation.  In particular,
we should not assume that they are FileSystem files, and hence should not pass Path in a primitive
MapReduce API.  Thus areValidInputDirectories(Path[]) should evolve into hasValidInput(JobConf),
but not as a part of this patch.

> Otherwise we will need to expand the globs twice

For now, I think the glob can happen twice: once when checking the input in the JobClient,
and once on the JobTracker when constructing splits.  Longer term we may wish to try to optmize
this, but I'm not convinced that's required.  If it is required, then perhaps, instead of
hasValidInput(JobConf) we could have a prepareInput(JobConf) method that validates inputs
and is permitted to alter the job.  But we shouldn't modify the InputFormat API for this issue.
 Owen's already working on that separately.

> Unify Map-Reduce and Streaming to take the same globbed input specification
> ---------------------------------------------------------------------------
>                 Key: HADOOP-619
>                 URL: http://issues.apache.org/jira/browse/HADOOP-619
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: eric baldeschwieler
>         Assigned To: Sanjay Dahiya
>         Attachments: Hadoop-619.patch, Hadoop-619.patch, Hadoop-619.patch
> Right now streaming input is specified very differently from other map-reduce input.
 It would be good if these two apps could take much more similar input specs.
> In particular -input in streaming expects a file or glob pattern while MR takes a directory.
 It would be cool if both could take a glob patern of files and if both took a directory by
default (with some patern excluded to allow logs, metadata and other framework output to be
safely stored).
> We want to be sure that MR input is backward compatible over this change.  I propose
that a single file should be accepted as an input or a single directory.  Globs should only
match directories if the paterns is '/' terminated, to avoid massive inputs specified by mistake.
> Thoughts?

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message