hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Chern <idry...@gmail.com>
Subject Re: Hadoop InputFormat - Processing large number of small files
Date Thu, 21 Aug 2014 15:37:35 GMT
If I were you, I’ll first generate a file with those file name:

hadoop fs -ls > term_file

Then run the normal map reduce job

Felix

On Aug 21, 2014, at 1:38 AM, rab ra <rabmdu@gmail.com> wrote:

> Thanks for the link. If it is not required for CFinputformat to have contents of the
files in the map process but only the filename, what changes need to be done in the code?
> 
> rab.
> 
> On 20 Aug 2014 22:59, "Felix Chern" <idryman@gmail.com> wrote:
> I wrote a post on how to use CombineInputFormat:
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you are reading
in.
> In my example, I created FileLineWritable to include the filename in the mapper input
key.
> Then you can use the input key as:
> 
>   
>   public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
>     private Text txt = new Text();
>     private IntWritable count = new IntWritable(1);
>     public void map (FileLineWritable key, Text val, Context context) throws IOException,
InterruptedException{
>       StringTokenizer st = new StringTokenizer(val.toString());
>         while (st.hasMoreTokens()){
>           txt.set(key.fileName + st.nextToken());          
>           context.write(txt, count);
>         }
>     }
>   }
> 
> 
> Cheers,
> Felix
> 
> 
> On Aug 20, 2014, at 8:19 AM, rab ra <rabmdu@gmail.com> wrote:
> 
>> Thanks for the response.
>> 
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process
either as key or value. But, I think this file format reads the contents of the file. I wish
to have a inputformat that just gives filename or list of filenames.
>> 
>> Also, files are very small. The wholeFileInputFormat spans one map process per file
and thus results huge number of map processes. I wish to span a single map process per group
of files. 
>> 
>> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not
read the entire file but just filename.
>> 
>> 
>> regards
>> rab
>> 
>> regards
>> Bala
>> 
>> 
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yunus@gmail.com> wrote:
>> Have you looked at the WholeFileInputFormat implementations? There are quite a few
if search for them...
>> 
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>> 
>> Regards,
>> Shahab
>> 
>> 
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rabmdu@gmail.com> wrote:
>> Hello,
>> 
>> I have a use case wherein i need to process huge set of files stored in HDFS. Those
files are non-splittable and they need to be processed as a whole. Here, I have the following
question for which I need answers to proceed further in this.
>> 
>> 1.  I wish to schedule the map process in task tracker where data is already available.
How can I do it? Currently, I have a file that contains list of filenames. Each map get one
line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream
and work with it. Is there a way to ensure this map process is running on the node where the
file is available?. 
>> 
>> 2.  Since the files are not large and it can be called as 'small' files by hadoop
standard. Now, I came across CombineFileInputFormat that can process more than one file in
a single map process.  What I need here is a format that can process more than one files in
a single map but does not have to read the files, and either in key or value, it has the filenames.
In map process then, I can run a loop to process these files. Any help?
>> 
>> 3. Any othe alternatives?
>> 
>> 
>> 
>> regards
>> rab
>> 
>> 
>> 
> 


Mime
View raw message