hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David_ca <davidsupe...@gmail.com>
Subject Re: Subdirectory question revisited
Date Fri, 31 Jul 2009 17:07:44 GMT
The way I solved this problem for myself is that I created a file where each
line represents a file path on s3.
For example,

Then I filter the log filename, which contains the date, on a date range.
The lines that satisfy the date range, are used as input for the job.

On Thu, Jul 30, 2009 at 7:28 AM, David_ca <davidsuperca@gmail.com> wrote:

> Hi,
> This question refers to a thread that was asked back in June.
> http://www.mail-archive.com/core-user@hadoop.apache.org/msg10490.html
> I would like to do a similar thing. I have logs in a similar format to:
> /logs/<hostname>/<date>.log and I would like to selectively choose which
> logs to process in a date range.
> First I tried the approach suggested by Brian, writing a subroutine in the
> driver
> to descend through the file system starting at /logs and builds a list of
> input files.
> http://www.mail-archive.com/core-user@hadoop.apache.org/msg10492.html
> This approach did not work for me when I tried to use inputs from s3. It
> kept
> complaining about java.lang.IllegalArgumentException: Wrong FS.
> Then I tried the second approach that was suggested by writing a custom
> InputFormat
> that recursively traverses directories for files. This approach worked for
> S3 inputs.
> But I would like to pass two dates to my InputFormat so that it can use
> them as a
> date range to filter out files.
> I got stuck here because I couldn't figure out how to pass date parameters
> to the InputFormat.
> In my driver, I set the Inputformat as follows:
> conf.setInputFormat(FilterFileTextInputFormat.class);
> Any ideas on how I can get either approach to work?
> thanks,
> David

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message