Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 55210 invoked from network); 2 Jun 2009 23:23:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Jun 2009 23:23:02 -0000 Received: (qmail 25068 invoked by uid 500); 2 Jun 2009 23:23:12 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 24982 invoked by uid 500); 2 Jun 2009 23:23:12 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 24972 invoked by uid 99); 2 Jun 2009 23:23:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Jun 2009 23:23:12 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.132.244] (HELO an-out-0708.google.com) (209.85.132.244) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Jun 2009 23:23:03 +0000 Received: by an-out-0708.google.com with SMTP id c2so4957210anc.29 for ; Tue, 02 Jun 2009 16:22:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.249.14 with SMTP id w14mr396474anh.162.1243984961108; Tue, 02 Jun 2009 16:22:41 -0700 (PDT) In-Reply-To: <4A258A16.8050300@darose.net> References: <4A258A16.8050300@darose.net> From: Aaron Kimball Date: Tue, 2 Jun 2009 16:22:21 -0700 Message-ID: Subject: Re: Subdirectory question revisited To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016368e1d5db68442046b65d17c X-Virus-Checked: Checked by ClamAV on apache.org --0016368e1d5db68442046b65d17c Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit There is no technical limit that prevents Hadoop from operating in this fashion; it's simply the case that the included InputFormat implementations do not do so. This behavior has been set in this fashion for a long time, so it's unlikely that it will change soon, as that might break existing applications. But you can write your own subclass of TextInputFormat or SequenceFileInputFormat that overrides the getSplits() method to recursively descend through directories and search for files. - Aaron On Tue, Jun 2, 2009 at 1:22 PM, David Rosenstrauch wrote: > As per a previous list question ( > http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3Ce75c02ef0804011433x144813e6x2450da7883de3aca@mail.gmail.com%3E) > it looks as though it's not possible for hadoop to traverse input > directories recursively in order to discover input files. > > Just wondering a) if there's any particular reason why this functionality > doesn't exist, and b) if not, if there's any workaround/hack to make it > possible. > > Like the OP, I was thinking it would be helpful to partition my input data > by year, month, and day. I figured his would enable me to run jobs against > specific date ranges of input data, and thereby speed up the execution of my > jobs since they wouldn't have to process every single record. > > Any way to make this happen? (Or am I totally going about this the wrong > way for what I'm trying to achieve?) > > TIA, > > DR > --0016368e1d5db68442046b65d17c--