hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: Hadoop InputFormat - Processing large number of small files
Date Wed, 20 Aug 2014 13:18:52 GMT
Have you looked at the WholeFileInputFormat implementations? There are
quite a few if search for them...

http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java

Regards,
Shahab


On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rabmdu@gmail.com> wrote:

> Hello,
>
> I have a use case wherein i need to process huge set of files stored in
> HDFS. Those files are non-splittable and they need to be processed as a
> whole. Here, I have the following question for which I need answers to
> proceed further in this.
>
> 1.  I wish to schedule the map process in task tracker where data is
> already available. How can I do it? Currently, I have a file that contains
> list of filenames. Each map get one line of it via NLineInputFormat. The
> map process then accesses the file via FSDataInputStream and work with it.
> Is there a way to ensure this map process is running on the node where the
> file is available?.
>
> 2.  Since the files are not large and it can be called as 'small' files by
> hadoop standard. Now, I came across CombineFileInputFormat that can process
> more than one file in a single map process.  What I need here is a format
> that can process more than one files in a single map but does not have to
> read the files, and either in key or value, it has the filenames. In map
> process then, I can run a loop to process these files. Any help?
>
> 3. Any othe alternatives?
>
>
>
> regards
> rab
>
>

Mime
View raw message