From Hassen Riahi <hassen.ri...@cern.ch>
Subject Re: Parallelize a workflow using mapReduce
Date Thu, 23 Jun 2011 14:12:28 GMT

> Hi,
>> Between, how the split will be done? I mean the input will be split  
>> by HDFS
>> block? will I have 1 map task per HDFS block?
> The default behavior is to split the file based on the HDFS block
> size, but this depends on the InputFormat and you can also write your
> own InputFormat to create a split of the size/nature that you want.
> There are already many InputFormats that other people have written
> too, please have a look, examples include: splits of N lines,
> one-split per file and so on.
> Yes, the default behavior is to have one mapper per input split, but
> again, this can be overridden by a custom inputformat- for example, if
> you ask the inputformat not to split a file, and if the file is bigger
> than the block size.
> will this workflow benefit
>> from Hadoop data locality optimization?
> I did not understand this question.

Sorry, I was not clear enough...let's say that I have 1 file stored in  
HDFS and so, let's say that it is split in 3 HDFS blocks. Let's say  
that these HDFS blocks, blockA, blockB and blockC, reside respectively  
in machineA, machineB and machineC.

In another side, let's say also that this file is the input file and  
it is split based on HDFS block and so, I will have one mapper per  
input split (I will have in consequence 3 mappers: mapperA, mapperB  
and mapperC).

If I understand, it is waited that the mapperA will be executed on  
machineA, and mapperB on machineB...right? if it is the case, that is  
what I intended by the data locality optimization...the fact that each  
mapper will be executed on the machine where data reside optimizes the  
workflow execution, the traffic inside the cluster...

Now, in case that this input file is not split based on HDFS block but  
one-split per file. I will have in consequence only 1 mapper since I  
have only 1 input split. Where the computation of the mapper takes  
place? in machineA or machineB or machine C or in another machine  
inside the cluster? or it is not possible to predict the behavior of  
the system?

Thanks for the help,

> Thanks,
> -b
>>> I hope I understood your problem properly, and my suggestion is the
>>> kind you were looking for.
>> Thanks,
>> Hassen
>>> Bibek

