hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rab ra <rab...@gmail.com>
Subject Re: Hadoop InputFormat - Processing large number of small files
Date Wed, 20 Aug 2014 15:19:25 GMT
Thanks for the response.

Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
process either as key or value. But, I think this file format reads the
contents of the file. I wish to have a inputformat that just gives filename
or list of filenames.

Also, files are very small. The wholeFileInputFormat spans one map process
per file and thus results huge number of map processes. I wish to span a
single map process per group of files.

I think I need to tweak CombineFileInputFormat's recordreader() so that it
does not read the entire file but just filename.



On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yunus@gmail.com>

> Have you looked at the WholeFileInputFormat implementations? There are
> quite a few if search for them...
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
> Regards,
> Shahab
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rabmdu@gmail.com> wrote:
>> Hello,
>> I have a use case wherein i need to process huge set of files stored in
>> HDFS. Those files are non-splittable and they need to be processed as a
>> whole. Here, I have the following question for which I need answers to
>> proceed further in this.
>> 1.  I wish to schedule the map process in task tracker where data is
>> already available. How can I do it? Currently, I have a file that contains
>> list of filenames. Each map get one line of it via NLineInputFormat. The
>> map process then accesses the file via FSDataInputStream and work with it.
>> Is there a way to ensure this map process is running on the node where the
>> file is available?.
>> 2.  Since the files are not large and it can be called as 'small' files
>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>> process more than one file in a single map process.  What I need here is a
>> format that can process more than one files in a single map but does not
>> have to read the files, and either in key or value, it has the filenames.
>> In map process then, I can run a loop to process these files. Any help?
>> 3. Any othe alternatives?
>> regards
>>  rab

View raw message