hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Benchmarking performance in Amazon EC2/EMR environment
Date Tue, 01 Feb 2011 15:48:17 GMT
On 31/01/11 23:22, Aaron Eng wrote:
> Hi all,
> I was wondering if any of you have had a similar experience working with
> Hadoop in Amazon's environment.  I've been running a few jobs over the last
> few months and have noticed them taking more and more time.  For instance, I
> was running teragen/terasort/teravalidate as a benchmark and I've noticed
> the average execution times of all three jobs have increased by 25-33% this
> month vs. what I was seeing in December.  When I was able to quantify this I
> started collected some disk IO stats using SAR and dd.  I found that on any
> given node in an EMR cluster, the throughput to the ephemeral storage ranged
> from<30MB/s to>400MB/s.  I also noticed that when using EBS volumes, the
> throughput would range from ~20MB/s up to 100MB/s.  Since those jobs are I/O
> bound I would have to assume that these huge swings in speed are causing my
> jobs to take longer.  Unfortunately I wasn't collecting the SAR/dd info in
> December so I don't have anything to compare it too.

-are you asking for XL or bigger VMs to get the full physical host and 
less network throtting?

-does it behave differently if you bring up clusters on different sites?

> Just wondering if others have done these types of performance benchmarks and
> how they went about tuning Hadoop or tuning how you run your jobs to mediate
> the effects.  If these were small variations in performance I wouldn't be
> too concerned.  But in any given test, I can have a drive running>20x
> faster/slower than another drive.

View raw message