hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maha <m...@umail.ucsb.edu>
Subject Re: repeat a job for different files
Date Thu, 18 Nov 2010 18:49:46 GMT
Hi Alex, 

  Thanks for the reply well this is what I did. But the problem then is that I have to write
the MR code three times !!!

 If I have only one MR code and pass one file at a time to it, the problem will be in the
FileOutputFormat because it will say output file is already there!

  So I had to repeat the same code for MR three times with only difference in the output directory
which is unreasonable because what if I have 100 files :(

  I guess another solution would be to know HOW to add an output file into an existing directory
instead of creating a new one (using the FileOutputFormat) ?


On Nov 17, 2010, at 10:11 PM, Alex Baranau wrote:

> In case you need to process the files separately, use one MR job for each
> file. You can add a single file as input. I believe you'll need to iterate
> over all files in input dir and start job instance for each file. You can do
> this in java code or in script or... depending on your case.
> Alex Baranau
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
> On Wed, Nov 17, 2010 at 10:36 PM, maha <maha@umail.ucsb.edu> wrote:
>> Hi,
>>  When I set my inputFileFormat to take an input directory with three files
>> in, the job is processed on all three and the output is one containing the
>> result from all of them.
>> Instead I want the job to be repeated separately for each inputFile and
>> hence a different output.
>>   Eg.
>>        wordCount(input)   where input/ file1.txt  file2.txt ... fileN.txt
>>       This happens:   output/    outputFile.txt will contain all words
>> from all the files along with their counts.
>>       I want: output /   outputFile1.txt  for file1.txt words    ,
>> ........ , outputFileN.txt for fileN.txt
>>      Thanks,
>>          Maha

View raw message