hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: Is perfect control over mapper num AND split distribution possible?
Date Tue, 21 Jan 2014 19:43:48 GMT
You cannot use hadoop "NLineInputFormat"?
If you generate 100 lines of text file, by default, one line will trigger one mapper task.
As long as you have 100 task slot available, you will get 100 mapper running concurrently.
You want perfect control over mapper num? NLineInputFormat is designed for your purpose.

> From: kwiley@keithwiley.com
> Subject: Is perfect control over mapper num AND split distribution possible?
> Date: Tue, 21 Jan 2014 11:28:32 -0800
> To: user@hadoop.apache.org
> I am running a job that takes no input from the mapper-input key/value interface.  Each
job reads the same small file from the distributed cache and processes it independently (to
generate Monte Carlo sampling of the problem space).  I am using MR purely to parallelize
the otherwise redundant and separated sampling process.  To maximize parallelism, I want to
set the number of mappers explicitly, such that 10 samples run in exact 1X time by perfectly
distributing over 10 mappers.  I am accomplishing this by generating a dummy MR input file
of nonvalue data.  Each row is identical so I know the exact row length of all rows.  I then
simply set the split size to the row length with the intention that Hadoop perfectly assign
the intended number of mappers.  This approach mostly works.  However, I get a few extraneous
empty mappers.  Since they get no input, they do no work and exit almost immediately, so they
aren't a serious drain on cluster resources, but I'm confused why I get extra mappers in the
first place.
> My working theory was that the end-lines of the input file must be accounted for when
calculating split sizes (so my splits were too small and I got a few extra splits hanging
off the end of the input file).  I attempted to fix this by adding one to the calculated split
size (one greater than the actual row length now).  This works perfectly, generating exactly
the intended number of mappers, exactly the same number as there are rows in the input file.
 However, the labor distribution is not perfect.  Almost every single run produces one mapper
which receives no input (and ends immediately) and another mapper which receives two inputs,
thus triggering two "processing sessions" on that particular mapper such that it takes twice
as long to complete as the other mappers.  Obviously, this wrecks the potential parallelism
by literally doubling the overall job time.
> Which split size is correct: row length without end-line or row length with end-line?
 The former yields extra empty mappers while the latter yields exactly the right number. 
However, if the latter is correct, why is the task distribution uneven (albeit NEARLY even)
and what (if anything) can be done about it?
> Thanks.
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>                                            --  Mark Twain
> ________________________________________________________________________________
View raw message