hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bibek Paudel <eternalyo...@gmail.com>
Subject Re: Parallelize a workflow using mapReduce
Date Thu, 23 Jun 2011 13:25:45 GMT

> Between, how the split will be done? I mean the input will be split by HDFS
> block? will I have 1 map task per HDFS block?

The default behavior is to split the file based on the HDFS block
size, but this depends on the InputFormat and you can also write your
own InputFormat to create a split of the size/nature that you want.
There are already many InputFormats that other people have written
too, please have a look, examples include: splits of N lines,
one-split per file and so on.

Yes, the default behavior is to have one mapper per input split, but
again, this can be overridden by a custom inputformat- for example, if
you ask the inputformat not to split a file, and if the file is bigger
than the block size.

 will this workflow benefit
> from Hadoop data locality optimization?

I did not understand this question.


>> I hope I understood your problem properly, and my suggestion is the
>> kind you were looking for.
> Thanks,
> Hassen
>> Bibek

View raw message