hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: Hadoop InputFormat - Processing large number of small files
Date Wed, 20 Aug 2014 13:18:52 GMT
Have you looked at the WholeFileInputFormat implementations? There are
quite a few if search for them...



On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rabmdu@gmail.com> wrote:

> Hello,
> I have a use case wherein i need to process huge set of files stored in
> HDFS. Those files are non-splittable and they need to be processed as a
> whole. Here, I have the following question for which I need answers to
> proceed further in this.
> 1.  I wish to schedule the map process in task tracker where data is
> already available. How can I do it? Currently, I have a file that contains
> list of filenames. Each map get one line of it via NLineInputFormat. The
> map process then accesses the file via FSDataInputStream and work with it.
> Is there a way to ensure this map process is running on the node where the
> file is available?.
> 2.  Since the files are not large and it can be called as 'small' files by
> hadoop standard. Now, I came across CombineFileInputFormat that can process
> more than one file in a single map process.  What I need here is a format
> that can process more than one files in a single map but does not have to
> read the files, and either in key or value, it has the filenames. In map
> process then, I can run a loop to process these files. Any help?
> 3. Any othe alternatives?
> regards
> rab

View raw message