hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rab ra <rab...@gmail.com>
Subject Hadoop InputFormat - Processing large number of small files
Date Wed, 20 Aug 2014 05:46:38 GMT

I have a use case wherein i need to process huge set of files stored in
HDFS. Those files are non-splittable and they need to be processed as a
whole. Here, I have the following question for which I need answers to
proceed further in this.

1.  I wish to schedule the map process in task tracker where data is
already available. How can I do it? Currently, I have a file that contains
list of filenames. Each map get one line of it via NLineInputFormat. The
map process then accesses the file via FSDataInputStream and work with it.
Is there a way to ensure this map process is running on the node where the
file is available?.

2.  Since the files are not large and it can be called as 'small' files by
hadoop standard. Now, I came across CombineFileInputFormat that can process
more than one file in a single map process.  What I need here is a format
that can process more than one files in a single map but does not have to
read the files, and either in key or value, it has the filenames. In map
process then, I can run a loop to process these files. Any help?

3. Any othe alternatives?


View raw message