hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cogan, Peter (Peter)" <Peter.Co...@alcatel-lucent.com>
Subject understanding performance
Date Mon, 03 Dec 2012 15:27:35 GMT
Hi there,

I've been doing some performance testing with hadoop and have been experiencing highly variable
results which I am trying to understand. I've been examining how long it takes to perform
a particular MR job, and am finding that the time taken varies by a factor of 2 when I repeat
the job. Note that the data, algorithm, cluster etc is completely the same (and I am the only
person on the cluster).

The way I do the test is from a simple shell script that just runs the job again and again.
I find that the job is as fast as 5 mins, but as slow as 10 mins, with everything in between.

I've examined the output of two log files, where I can see that the performance difference
is coming from the map and shuffle phases. For a sample 'fast' job, the map phases take on
average 2 mins 34 secs, whereas for a sample 'slow' jobs the phases take on average 4 mins
12 secs. Interestingly, if I then look at the counters for random maps (one each from the
fast and slow jobs) then I find that all counters are pretty much equal – including CPU
time. This suggests that the slowdown comes from bottlenecks at disk I/O or network. Since
I am the only user on the network (it's a dedicated GB switch) and the only one using the
disks, I don't understand what can be happening. Also, the total data is not that huge –
the job analyses 21GB with replication 2 spread across 8 disks on 4 nodes. The total disk
output from the reducers is about 300MB. I'm not sure how to investigate further – is there
some other diagnostic within hadoop that can tell me where the code is waiting (e.g. For network
or disk I/O) – or perhaps some system tool that can indicate performance hits in specific
places?

Thanks for any suggestions

Peter



Mime
View raw message