hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Performance of EC2
Date Tue, 26 Jan 2010 18:04:22 GMT
On Tue, Jan 26, 2010 at 8:47 AM, Something Something
<mailinglists19@gmail.com> wrote:
> I have noticed some strange performance numbers on EC2.  If someone can give
> me some hints to improve performance that would be greatly appreciated.
>  Here are the details:
> I have a process that runs a series of Jobs under Hadoop 0.20.1 & Hbase
> 0.20.2  I ran the *exact* same process with following configurations:
> 1) 1 Master & 4 Workers (*c1.xlarge* instances) & 1 Zookeeper (*c1.medium*)
> with *8 Reducers *for every Reduce task.  The process completed in *849*
>  seconds.

How many concurrent reducers run on each node?  Default two?

> 2) 1 Master, 4 Workers & 1 Zookeeper  *ALL m1.small* instances with *8
> Reducers *for every Reduce task.  The process completed in *906* seconds.
> 3) 1 Master, *11* Workers & *3* Zookeepers  *ALL m1.small* instances with *20
> Reducers *for every Reduce task.  The process completed in *984* seconds!

How much of this overall time is spent in reduce phase, in particular
the time spent inserting into hbase? (Starts at 66% IIRC)

> Two main questions:
> 1)  It's totally surprising that when I have 11 workers with 20 Reducers it
> runs slower than when I have exactly same type of fewer machines with fewer
> reducers..

Yes.  My guess is that on the small instances, that if you ran the job
multiple times that there would be large variance in how long it takes
to complete.

> 2)  As expected it runs faster on c1.xlarge, but the performance improvement
> doesn't justify the high cost difference.  I must not be utilizing the
> machine power, but I don't know how to do that.

The main reason for xlarge is that the platform is more predictable in
its performance profile than small sized instances.  I'm a little
surprised that all worked on the small instances, that your jobs

I'd suggest you spend a bit of time figuring where your MR jobs are
spending their time?  Is it all doing hbase inserts?  Are inserts to a
new table?

> Here are some of the performance improvements tricks that I have learnt from
> this mailing list in the past that I am using:
> 1)  conf.set("hbase.client.scanner.caching", "30");   I have this for all
> jobs

FYI, you can set this on the Scan instance rather than globally in the
conf.  Just FYI.

> 2)  Using the following code every time I open a HTable:
>        this.table = new HTable(new HBaseConfiguration(), "tablenameXYZ");
>        table.setAutoFlush(false);
>        table.setWriteBufferSize(1024 * 1024 * 12);

Are you opening a new table inside each task or once up in the config?

> 4)  Change the No. of Reducers as per the No. of Workers.  I believe the
> formula is:  # of workers * 1.75.

You have to temper the above general rule with the fact that
tasktrackers and datanodes running on the same node can impinge upon
each other, often to the regionservers detriment.

Thats enough for now.  I'm sure others on list have opinions on the above.


View raw message