hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Schuett <schu...@zib.de>
Subject Re: Reduce Performance
Date Thu, 23 Aug 2007 08:54:30 GMT
On Wednesday 22 August 2007, Doug Cutting wrote:
> Thorsten Schuett wrote:
> > In my case, it looks as if the loopback device is the bottleneck. So
> > increasing the number of tasks won't help.
> Hmm.  I have trouble believing that the loopback device is actually the
> bottleneck.  What makes you think that it is?
During the copy phase of reduce, the cpu load was very low and vmstat showed 
constant reads from the disk at ~15MB/s and bursty writes. At the same time, 
data was sent over the loopback device at ~15MB/s. I don't see what else 
could limit the performance here. The disk can certainly provide the data at 
higher speeds.

I'll be happy to repeat my experiments with the MiniMR Code. But I need a 
pointer how to proceed/where to start.


> To better support standalone use of Hadoop on multicore boxes, perhaps
> we should promote the MiniMR cluster code from test into the core.  This
> runs the tasktracker and jobtracker in the same process.  It still forks
> processes for tasks, and has all the features of a grid setup: web ui,
> task restarting, etc.
> I don't think we should spend much effort adding multi-threading to
> LocalRunner, since it lacks so many of the other features of
> TaskTracker/JobTracker.  We should also avoid re-implementing those
> features.  Thus running TaskTracker and JobTracker in the same JVM seems
> like a good strategy for multicore support.
> If performance with a MiniMR cluster is not good, then we should
> determine why.  We could, e.g., benchmark and profile sort performance
> in this configuration.  Again, I have a hard time believing that
> loopback bandwidth is a bottleneck.  If it is, then perhaps we can
> optimize around it, but let's first be sure that's the case.
> Note that, when running standalone, even with TaskTracker and
> JobTracker, one need not use HDFS.  Direct access to the local
> filesystem will probably be considerably faster.
> Doug

View raw message