I am running a job that takes no input from the mapperinput key/value interface. Each job
reads the same small file from the distributed cache and processes it independently (to generate
Monte Carlo sampling of the problem space). I am using MR purely to parallelize the otherwise
redundant and separated sampling process. To maximize parallelism, I want to set the number
of mappers explicitly, such that 10 samples run in exact 1X time by perfectly distributing
over 10 mappers. I am accomplishing this by generating a dummy MR input file of nonvalue
data. Each row is identical so I know the exact row length of all rows. I then simply set
the split size to the row length with the intention that Hadoop perfectly assign the intended
number of mappers. This approach mostly works. However, I get a few extraneous empty mappers.
Since they get no input, they do no work and exit almost immediately, so they aren't a serious
drain on cluster resources, but I'm confused why I get extra mappers in the first place.
My working theory was that the endlines of the input file must be accounted for when calculating
split sizes (so my splits were too small and I got a few extra splits hanging off the end
of the input file). I attempted to fix this by adding one to the calculated split size (one
greater than the actual row length now). This works perfectly, generating exactly the intended
number of mappers, exactly the same number as there are rows in the input file. However,
the labor distribution is not perfect. Almost every single run produces one mapper which
receives no input (and ends immediately) and another mapper which receives two inputs, thus
triggering two "processing sessions" on that particular mapper such that it takes twice as
long to complete as the other mappers. Obviously, this wrecks the potential parallelism by
literally doubling the overall job time.
Which split size is correct: row length without endline or row length with endline? The
former yields extra empty mappers while the latter yields exactly the right number. However,
if the latter is correct, why is the task distribution uneven (albeit NEARLY even) and what
(if anything) can be done about it?
Thanks.
________________________________________________________________________________
Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com
"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
 Mark Twain
________________________________________________________________________________
