Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 79683 invoked from network); 5 Dec 2006 08:08:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Dec 2006 08:08:24 -0000 Received: (qmail 99379 invoked by uid 500); 5 Dec 2006 08:08:32 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 99357 invoked by uid 500); 5 Dec 2006 08:08:32 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 99348 invoked by uid 99); 5 Dec 2006 08:08:32 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Dec 2006 00:08:32 -0800 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_WHOIS X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [207.126.228.150] (HELO rsmtp2.corp.yahoo.com) (207.126.228.150) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Dec 2006 00:08:21 -0800 Received: from [192.168.1.66] ([172.21.179.11]) by rsmtp2.corp.yahoo.com (8.13.8/8.13.6/y.rout) with ESMTP id kB587qMD024674 for ; Tue, 5 Dec 2006 00:07:52 -0800 (PST) Mime-Version: 1.0 (Apple Message framework v624) In-Reply-To: <16595805.1165304482949.JavaMail.jira@brutus> References: <16595805.1165304482949.JavaMail.jira@brutus> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Arkady Borkovsky Subject: Re: [jira] Commented: (HADOOP-619) Unify Map-Reduce and Streaming to take the same globbed input specification Date: Tue, 5 Dec 2006 00:07:51 -0800 To: hadoop-dev@lucene.apache.org X-Mailer: Apple Mail (2.624) X-Virus-Checked: Checked by ClamAV on apache.org I have a directory with per-week subdirectories. Currently, I have specify a separate -input for each subdirectory, with a glob pattern (the data start in 2004, and is supposed to be added weekly, ongoing). It is possible to flatten the structure, but is there a big conceptual or implementational problem with general globbing? On Dec 4, 2006, at 11:41 PM, eric baldeschwieler (JIRA) wrote: > [ > http://issues.apache.org/jira/browse/HADOOP-619? > page=comments#action_12455532 ] > > eric baldeschwieler commented on HADOOP-619: > -------------------------------------------- > > Perhaps we should just limit to either globing or a single directory > per argument and simply drop directories from globbing? This seems > fairly simple and not too restrictive in practice. > > I agree that if a directory is used we should exclude files starting > with "_". This will allow us to put metadata in output directories. > I think we should also simply exclude subdirectories in input > directories. Again, I doubt this will prove restrictive in practice. > > It seems to me we should error out if any glob matches no files or a > listed input directory is not present. Perhaps we could provide > another switch for an optional input in case users actual want a job > to run if an input spec doesn't match any input. > >> Unify Map-Reduce and Streaming to take the same globbed input >> specification >> ---------------------------------------------------------------------- >> ----- >> >> Key: HADOOP-619 >> URL: http://issues.apache.org/jira/browse/HADOOP-619 >> Project: Hadoop >> Issue Type: Improvement >> Components: mapred >> Reporter: eric baldeschwieler >> Assigned To: Sanjay Dahiya >> >> Right now streaming input is specified very differently from other >> map-reduce input. It would be good if these two apps could take much >> more similar input specs. >> In particular -input in streaming expects a file or glob pattern >> while MR takes a directory. It would be cool if both could take a >> glob patern of files and if both took a directory by default (with >> some patern excluded to allow logs, metadata and other framework >> output to be safely stored). >> We want to be sure that MR input is backward compatible over this >> change. I propose that a single file should be accepted as an input >> or a single directory. Globs should only match directories if the >> paterns is '/' terminated, to avoid massive inputs specified by >> mistake. >> Thoughts? > > -- > This message is automatically generated by JIRA. > - > If you think it was sent incorrectly contact one of the > administrators: > http://issues.apache.org/jira/secure/Administrators.jspa > - > For more information on JIRA, see: > http://www.atlassian.com/software/jira > >