hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject Re: Multiple various streaming questions
Date Mon, 07 Feb 2011 23:39:14 GMT
On Feb 7, 2011, at 13:39 , Allen Wittenauer wrote:

> On Feb 4, 2011, at 7:46 AM, Keith Wiley wrote:

[On the topic of why I care if Hadoop funnels and queues multiple input splits into a a small
number mappers instead of perfectly parallelizing the job across the available slots...]

>> Because all slots are not in use.  It's a very larger cluster and it's excruciating
that Hadoop partially serializes a job by piling multiple map tasks onto a single map in a
queue even when the cluster is massively underutilized.
> 	Well, sort of.
> 	The only input hadoop has to go on is your filename input which is relatively tiny.
 So of course it is going to underutilize.  This makes sense now. :)

I think we're talking around each other a little bit here.  I'm sorry.  In in my original
description, I was referring to the nonstreaming version of my program.  The all-Java version
doesn't use filenames, it sets up actual Hadoop input splits from files stored on HDFS.  These
files are about 6MB after decompression.  My point, earlier in this thread, was Hadoop's default
behavior, even in that case which used the actual "largish" files as the inputs, still assigned
many input splits to a single mapper (since they are smaller than a block) instead of achieving
perfect parallelism.

The degree of queueing seemed perfectly coordinated with the block size of 64Mb.  That is
to say that given my input files of 6MB each, Hadoop would assign about 10 of them per mapper...where
I wanted one per mapper and ten times as many mappers.

Then, my final point was that in the nonstreaming all Java case, I could *NOT* achieve the
desired behavior simply by setting mapred.map.tasks to a high number, say, one per input file
(I honestly don't remember what the behavior was when I tried this, it was a very long time
ago).  This simply did not work, Hadoop ignored it and queued up all my inputs anyway.  What
I had to do was set mapred.max.split.size really small so that Hadoop would not be willing
to queue inputs up per mapper.  Ideally, I would set mapred.max.split.size slightly larger
than a single input file, about 6MB.  Doing this achieves my desired goal, one input per mapper,
perfect parallelism, minimum job turn-around time.

Now, all that said, I am perfectly open to discussion or suggestions as to how I ought to
better handle this situation, including the notion that mapred.map.tasks should have worked
the way I intended in the first place (Did I just do something wrong there?  Should it have
worked the way I expected it to?).  At any rate, what is the proper Hadoop method for evenly
distributing inputs across all nodes before doubling up on any given node?

Sorry, maybe this thread is getting a bit rambling.  We can drop it if people prefer.....


Keith Wiley               kwiley@keithwiley.com               www.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
  -- Galileo Galilei

View raw message