hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Chern <idry...@gmail.com>
Subject Re: Hadoop InputFormat - Processing large number of small files
Date Wed, 20 Aug 2014 17:28:25 GMT
I wrote a post on how to use CombineInputFormat:
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
In the RecordReader constructor, you can get the context of which file you are reading in.
In my example, I created FileLineWritable to include the filename in the mapper input key.
Then you can use the input key as:

  
  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
    private Text txt = new Text();
    private IntWritable count = new IntWritable(1);
    public void map (FileLineWritable key, Text val, Context context) throws IOException,
InterruptedException{
      StringTokenizer st = new StringTokenizer(val.toString());
        while (st.hasMoreTokens()){
          txt.set(key.fileName + st.nextToken());          
          context.write(txt, count);
        }
    }
  }


Cheers,
Felix


On Aug 20, 2014, at 8:19 AM, rab ra <rabmdu@gmail.com> wrote:

> Thanks for the response.
> 
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either
as key or value. But, I think this file format reads the contents of the file. I wish to have
a inputformat that just gives filename or list of filenames.
> 
> Also, files are very small. The wholeFileInputFormat spans one map process per file and
thus results huge number of map processes. I wish to span a single map process per group of
files. 
> 
> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read
the entire file but just filename.
> 
> 
> regards
> rab
> 
> regards
> Bala
> 
> 
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yunus@gmail.com> wrote:
> Have you looked at the WholeFileInputFormat implementations? There are quite a few if
search for them...
> 
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
> 
> Regards,
> Shahab
> 
> 
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rabmdu@gmail.com> wrote:
> Hello,
> 
> I have a use case wherein i need to process huge set of files stored in HDFS. Those files
are non-splittable and they need to be processed as a whole. Here, I have the following question
for which I need answers to proceed further in this.
> 
> 1.  I wish to schedule the map process in task tracker where data is already available.
How can I do it? Currently, I have a file that contains list of filenames. Each map get one
line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream
and work with it. Is there a way to ensure this map process is running on the node where the
file is available?. 
> 
> 2.  Since the files are not large and it can be called as 'small' files by hadoop standard.
Now, I came across CombineFileInputFormat that can process more than one file in a single
map process.  What I need here is a format that can process more than one files in a single
map but does not have to read the files, and either in key or value, it has the filenames.
In map process then, I can run a loop to process these files. Any help?
> 
> 3. Any othe alternatives?
> 
> 
> 
> regards
> rab
> 
> 
> 


Mime
View raw message