hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-619) Unify Map-Reduce and Streaming to take the same globbed input specification
Date Mon, 04 Dec 2006 23:21:23 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-619?page=comments#action_12455448 ] 
Runping Qi commented on HADOOP-619:

A couple of points:

1. specifying input:
I think the user should specify the input files in the following way:

spec_1,spec_2, ...

Each spec above is a "root" directory, with an optional a pttern specification 
(regex, or simple wildcad based patterns). 
The semantics for a spec is that all the files directly or indirectly under the root are input
as long as the paths match the pattern. 
If the pattern is ommitted in the spec, that means all the files under the root dir are the
input files.

Here are a couple of examples:

"foo/"  all the files under foo tree

"foo/;*.gz": all the files under foo tree, with extension .gz

"foo/; */bar/*.gz: all the files under foo tree, with extension .gz and with bar as a intermediate

2. Checking and matching:
    The jobclient should check the existence of the root dirs. 
    If none of  the root dirs exists, then the job should fail immediately.
    If some root dirs do not exist, the job client should generate warning.

    The InputFormatbase should perform the file list generation and matching. The list of
matched files 
    should be part of the job's status so that the user can examine them through web UI.

> Unify Map-Reduce and Streaming to take the same globbed input specification
> ---------------------------------------------------------------------------
>                 Key: HADOOP-619
>                 URL: http://issues.apache.org/jira/browse/HADOOP-619
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: eric baldeschwieler
>         Assigned To: Sanjay Dahiya
> Right now streaming input is specified very differently from other map-reduce input.
 It would be good if these two apps could take much more similar input specs.
> In particular -input in streaming expects a file or glob pattern while MR takes a directory.
 It would be cool if both could take a glob patern of files and if both took a directory by
default (with some patern excluded to allow logs, metadata and other framework output to be
safely stored).
> We want to be sure that MR input is backward compatible over this change.  I propose
that a single file should be accepted as an input or a single directory.  Globs should only
match directories if the paterns is '/' terminated, to avoid massive inputs specified by mistake.
> Thoughts?

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message