hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer <awittena...@linkedin.com>
Subject Re: How to control the number of map tasks for each nodes?
Date Thu, 22 Jul 2010 17:23:21 GMT

On Jul 22, 2010, at 4:40 AM, Vitaliy Semochkin wrote:
> If it was a context switching would the increasing number of
> mappers/reducers lead to performance improvement?

Woops, I misspoke.  I meant process switching (which I guess is a form of context switching).
 More on that later.

> I have one log file ~140GB I use default hdfs block size (64mb)

So that's not 'huge' by Hadoop standards at all. 

If I did my math correctly and it really is one big file, you're likely seeing somewhere on
the order of 2300 or so maps to process that file, right?  What is the average time per task?
What scheduler is being used?  What version of Hadoop?  How many machines?
> also I set  dfs.replication=1

Eek.  I wonder what your locality hit rate is.

> Am I right that the higher dfs.replication the faster map reduce will work
> because the probability that split will be on a local node will be equal to
> 1?

If you mena the block for the map input, yes, you have a much higher probability of it being
local and therefore faster.  

> Also, is it correct that it will slow down put operations? (technically put
> operations will run in parallel so I'm not sure if it will slow down
> performance or not)

I don't know if anyone has studied output replications factor on job performance.  I'm sure
someone has though.  I'm not fully awake yet (despite it being 10am), but I'm fairly certain
that the job of replicating the local block falls onto the DN not the client, so the client
may not be held up by replication at all.

>> More memory means that Hadoop doesn't have to spill to disk as often due to
>> being able to use a larger buffer in RAM.
> Does hadoop  check if it has enough memory for such operation?

That depends upon what you mean by 'check'.  By default, Hadoop will spawn whatever size heap
you want.  The OS, however, may have different ideas as to what is allowable. :)


What I suspect is really happening is that your tasks are not very CPU intensive and don't
take long to run through 64mb of data.  So your task turn around time is very very fast. 
So fast, in fact, that the scheduler can't keep up.  Boosting the number of tasks per node
actually helps because the *initial* scheduling puts you that much farther ahead.  

An interesting test to perform is to bump the block size up to 128mb.  You should see fewer
tasks that stick around a bit longer.  But you could very well see overall throughput go up
because you are spending less time cycling through JVMs.  [Even with reuse turned on.]

View raw message