hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: Performance question
Date Mon, 20 Apr 2009 15:24:46 GMT

On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote:

> Hi,
> I ran a Hadoop MapReduce task in the local mode, reading and writing  
> from
> HDFS, and it took 2.5 minutes. Essentially the same operations on  
> the local
> file system without MapReduce took 1/2 minute.  Is this to be  
> expected?

Hmm... some overhead is expected, but this seems too much. What  
version of Hadoop are your running?

It's hard to help without more details about your application,  
configuration etc., I'll try...

> It seemed that the system lost most of the time in the MapReduce  
> operation,
> such as after these messages
> 09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
> 09/04/19 23:23:01 INFO mapred.JobClient:  map 100% reduce 92%
> 09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce
> it waited for a long time. The final output lines were

It could either be the reduce-side merge or the hdfs-write. Can you  
check your task-logs and data-node logs?

> 09/04/19 23:24:13 INFO mapred.JobClient:     Combine input records=185
> 09/04/19 23:24:13 INFO mapred.JobClient:     Combine output  
> records=185

That shows that the combiner is useless for this app, turn it off - it  
adds unnecessary overhead.

> 09/04/19 23:24:13 INFO mapred.JobClient:   File Systems
> 09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes read=138103444
> 09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes  
> written=107357785
> 09/04/19 23:24:13 INFO mapred.JobClient:     Local bytes  
> read=282509133
> 09/04/19 23:24:13 INFO mapred.JobClient:     Local bytes  
> written=376697552

For the amount of data you are processing, you are doing far too much  
local-disk i/o.
'Local bytes written' should be _very_ close to the 'Map output bytes'  
i.e 91M for 'maps' and zero for reduces. (See the counters-table on  
the job-details web-ui.)

There are a few knobs you need to tweak to get closer to more optimal  
performance, the good news is that it's doable - the bad news is that  
one _has_ to get his/her fingers dirty...

Some knobs you will be interested in are:


* mapred.reduce.parallel.copies
* mapred.reduce.copy.backoff
* mapred.job.shuffle.input.buffer.percent
* mapred.job.shuffle.merge.percent
* mapred.inmem.merge.threshold
* mapred.job.reduce.input.buffer.percent

Check description for each of them in hadoop-default.xml or mapred- 
default.xml (depending on the version of Hadoop you are running).
Some more details available here: http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/TuningAndDebuggingMapReduce_ApacheConEU09.pdf


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message