hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Optimization of cpu and i/o usage / other bottlenecks?
Date Thu, 15 Oct 2009 19:57:25 GMT
What Hadoop version?

On a clusster this size there are two things to check right away:

1.  In the Hadoop UI, during the job, are the reduce and map slots close to
being filled up  most of the time, or are tasks completing faster than the
scheduler can keep up so that there are often many empty slots?

For 0.19.x and 0.20.x on a small cluster like this, use the Fair Scheduler
and make sure the configuration parameter that allows it to schedule more
than one task per heartbeat is on (at least one map and one reduce per,
which is supported in 0.19.x).  This alone will cut times down if the number
of map and reduce tasks is at least 2x the number of nodes.

2. Your CPU, Disk and Network aren't saturated -- take a look at the logs of
the reduce tasks and look for long delays in the shuffle.
Utilization is throttled by a bug in the reducer shuffle phase, not fixed
until 0.21.  Simply put, a single reduce task won't fetch more than one map
output from another node every 2 seconds (though it can fetch from multiple
nodes at once).  Fix this by commenting out one line in 0.18.x, 0.19.x or
0.20.x -- see my comment here:
>From June 10 2009.

I saw shuffle times on small clusters with large map task count per node
ratio decrease by a factor of 30 from that one line fix.  It was the only
way to get the network to ever be close to saturation on any node.

The delays for low latency jobs on smaller clusters are predominantly
artificial due to the nature of most RPC being ping-response and most design
and testing done for large clusters of machines that only run a couple maps
or reduces per TaskTracker.

Do the above, and you won't be nearly as sensitive to the size of data per
task for low latency jobs as out-of-the-box Hadoop.  Your overall
utilization will go up quite a bit.


On 10/14/09 7:31 AM, "Chris Seline" <chris@searchles.com> wrote:

> No, there doesn't seem to be all that much network traffic. Most of the
> time traffic (measured with nethogs) is about 15-30K/s on the master and
> slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe
> 5-10 seconds on a query that takes 10 minutes, but that is still less
> than what I see in scp transfers on EC2, which is typically about 30 MB/s.
> thanks
> Chris
> Jason Venner wrote:
>> are your network interface or the namenode/jobtracker/datanodes saturated
>> On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline <chris@searchles.com> wrote:
>>> I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11
>>> c1.xlarge instances (1 master, 10 slaves), that is the biggest instance
>>> available with 20 compute units and 4x 400gb disks.
>>> I wrote some scripts to test many (100's) of configurations running a
>>> particular Hive query to try to make it as fast as possible, but no matter
>>> what I don't seem to be able to get above roughly 45% cpu utilization on the
>>> slaves, and not more than about 1.5% wait state. I have also measured
>>> network traffic and there don't seem to be bottlenecks there at all.
>>> Here are some typical CPU utilization lines from top on a slave when
>>> running a query:
>>> Cpu(s): 33.9%us,  7.4%sy,  0.0%ni, 56.8%id,  0.6%wa,  0.0%hi,  0.5%si,
>>>  0.7%st
>>> Cpu(s): 33.6%us,  5.9%sy,  0.0%ni, 58.7%id,  0.9%wa,  0.0%hi,  0.4%si,
>>>  0.5%st
>>> Cpu(s): 33.9%us,  7.2%sy,  0.0%ni, 56.8%id,  0.5%wa,  0.0%hi,  0.6%si,
>>>  1.0%st
>>> Cpu(s): 38.6%us,  8.7%sy,  0.0%ni, 50.8%id,  0.5%wa,  0.0%hi,  0.7%si,
>>>  0.7%st
>>> Cpu(s): 36.8%us,  7.4%sy,  0.0%ni, 53.6%id,  0.4%wa,  0.0%hi,  0.5%si,
>>>  1.3%st
>>> It seems like if tuned properly, I should be able to max out my cpu (or my
>>> disk) and get roughly twice the performance I am seeing now. None of the
>>> parameters I am tuning seem to be able to achieve this. Adjusting
>>> mapred.map.tasks and mapred.reduce.tasks does help somewhat, and setting the
>>> io.file.buffer.size to 4096 does better than the default, but the rest of
>>> the values I am testing seem to have little positive  effect.
>>> These are the parameters I am testing, and the values tried:
>>> io.sort.factor=2,3,4,5,10,15,20,25,30,50,100
>>> mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9
>>> 0,0.93,0.95,0.97,0.98,0.99
>>> io.bytes.per.checksum=256,512,1024,2048,4192
>>> mapred.output.compress=true,false
>>> hive.exec.compress.intermediate=true,false
>>> hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9
>>> 0,0.93,0.95,0.97,0.98,0.99
>>> mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200
>>> mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m
>>> ,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m
>>> mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200
>>> mapred.merge.recordsBeforeProgress=5000,10000,20000,30000
>>> mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0
>>> .80,0.90,0.93,0.95,0.99
>>> io.sort.spill.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95
>>> ,0.99
>>> mapred.job.tracker.handler.count=3,4,5,7,10,15,25
>>> hive.merge.size.per.task=64000000,128000000,168000000,256000000,300000000,40
>>> 0000000
>>> hive.optimize.ppd=true,false
>>> hive.merge.mapredfiles=false,true
>>> io.sort.record.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.9
>>> 5,0.97,0.98,0.99
>>> hive.map.aggr.hash.percentmemory=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9
>>> 0,0.93,0.95,0.97,0.98,0.99
>>> mapred.tasktracker.reduce.tasks.maximum=1,2,3,4,5,6,8,10,12,15,20,30
>>> mapred.reduce.parallel.copies=1,2,4,6,8,10,13,16,20,25,30,50
>>> io.seqfile.lazydecompress=true,false
>>> io.sort.mb=20,50,75,100,150,200,250,350,500
>>> mapred.compress.map.output=true,false
>>> io.file.buffer.size=1024,2048,4096,8192,16384,32768,65536,131072,262144
>>> hive.exec.reducers.bytes.per.reducer=1000000000
>>> dfs.datanode.handler.count=1,2,3,4,5,6,8,10,15
>>> mapred.tasktracker.map.tasks.maximum=5,8,12,20
>>> Anyone have any thoughts for other parameters I might try? Am I going about
>>> this the wrong way? Am I missing some other bottleneck?
>>> thanks
>>> Chris Seline

View raw message