hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-619) Unify Map-Reduce and Streaming to take the same globbed input specification
Date Fri, 08 Dec 2006 18:29:25 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-619?page=comments#action_12456932 ] 
Doug Cutting commented on HADOOP-619:

I should have followed this patch more closely earlier.  Sorry!

This logic should not be in JobClient, but rather in InputFormatBase.  The kernel should know
very little about how inputs are specified, it should delegate all of that to the InputFormat.
 Inputs will usually be files in a Hadoop FileSystem, but they could be something else entirely,
that is, e.g., not easily modelled by Hadoop's FileSystem API.  So in mapred kernel code,
like JobClient, we shouldn't assume that inputs are from a FileSystem.

I have not been as good at enforcing this distinction in the past.  For example, JobConf.addInputPath(Path)
should be a static InputFormatBase.addInputPath(JobConf,Path).  The method is provided on
JobConf as a convenience, for jobs that specify their inputs with FileSystem paths, so this
isn't a gross violation: applications are not forced to use this method.

More seriously, the InputFormat method areValidInputDirectories(Path[]) should be more abstract:
we must not assume that inputs are always named with FileSystem Paths.  Rather, this method
should probably be something like hasValidInput(JobConf).  Then InputFormatBase can implement
it to glob, etc.  To make this even more clear, we should rename InputFormatBase to be FileSystemInputFormat.

But those changes are beyond the scope of this issue and should be made under separate issues.
 For this issue we need to move as much of the logic as possible out of JobClient and into

> Unify Map-Reduce and Streaming to take the same globbed input specification
> ---------------------------------------------------------------------------
>                 Key: HADOOP-619
>                 URL: http://issues.apache.org/jira/browse/HADOOP-619
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: eric baldeschwieler
>         Assigned To: Sanjay Dahiya
>         Attachments: Hadoop-619.patch, Hadoop-619.patch, Hadoop-619.patch
> Right now streaming input is specified very differently from other map-reduce input.
 It would be good if these two apps could take much more similar input specs.
> In particular -input in streaming expects a file or glob pattern while MR takes a directory.
 It would be cool if both could take a glob patern of files and if both took a directory by
default (with some patern excluded to allow logs, metadata and other framework output to be
safely stored).
> We want to be sure that MR input is backward compatible over this change.  I propose
that a single file should be accepted as an input or a single directory.  Globs should only
match directories if the paterns is '/' terminated, to avoid massive inputs specified by mistake.
> Thoughts?

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message