hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Something Something <mailinglist...@gmail.com>
Subject Re: Performance of EC2
Date Tue, 26 Jan 2010 19:20:03 GMT
Here are some of the answers:

>>  How many concurrent reducers run on each node?  Default two?
I was assuming 2 on each node would be the default.  If not, this could be a
problem.  Please let me know.

>>'d suggest you spend a bit of time figuring where your MR jobs
are spending their time?
I agree.  Will do some more research :)

>>How much of this overall time is spent in reduce phase?
Mostly time is spent in the Reduce phases, because that's where most of the
critical code is.

>>Are inserts to a new table?
Yes, all inserts will always be in a new table.  In fact, I disable/drop
HTables during this process.  Not using any special indexes, should I be?

>>I'm a little surprised that all worked on the small instances, that your
jobs completed.
But, really, shouldn't Amazon guarantee predictability :)  After all I am
paying for these instances.. albeit a small amount!

>>Are you opening a new table inside each task or once up in the config?
I open HTable in the 'setup' method for each mapper/reducer, and close table
in the 'cleanup' method.

>>You have to temper the above general rule with the fact that...
I will try a few combinations.

>>How big is your dataset?
This one in particular is not big, but the real production ones will be.
 Here's approximately how many rows get processed:
Phase 1:  300 rows
Phase 2 thru 8:  100 rows.
(Note:  Each phase does complex calculations on the row.)

Thanks for the help.


On Tue, Jan 26, 2010 at 10:36 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> How big is your dataset?
>
> J-D
>
> On Tue, Jan 26, 2010 at 8:47 AM, Something Something
> <mailinglists19@gmail.com> wrote:
> > I have noticed some strange performance numbers on EC2.  If someone can
> give
> > me some hints to improve performance that would be greatly appreciated.
> >  Here are the details:
> >
> > I have a process that runs a series of Jobs under Hadoop 0.20.1 & Hbase
> > 0.20.2  I ran the *exact* same process with following configurations:
> >
> > 1) 1 Master & 4 Workers (*c1.xlarge* instances) & 1 Zookeeper
> (*c1.medium*)
> > with *8 Reducers *for every Reduce task.  The process completed in *849*
> >  seconds.
> >
> > 2) 1 Master, 4 Workers & 1 Zookeeper  *ALL m1.small* instances with *8
> > Reducers *for every Reduce task.  The process completed in *906* seconds.
> >
> > 3) 1 Master, *11* Workers & *3* Zookeepers  *ALL m1.small* instances with
> *20
> > Reducers *for every Reduce task.  The process completed in *984* seconds!
> >
> >
> > Two main questions:
> >
> > 1)  It's totally surprising that when I have 11 workers with 20 Reducers
> it
> > runs slower than when I have exactly same type of fewer machines with
> fewer
> > reducers..
> > 2)  As expected it runs faster on c1.xlarge, but the performance
> improvement
> > doesn't justify the high cost difference.  I must not be utilizing the
> > machine power, but I don't know how to do that.
> >
> > Here are some of the performance improvements tricks that I have learnt
> from
> > this mailing list in the past that I am using:
> >
> > 1)  conf.set("hbase.client.scanner.caching", "30");   I have this for all
> > jobs.
> >
> > 2)  Using the following code every time I open a HTable:
> >        this.table = new HTable(new HBaseConfiguration(), "tablenameXYZ");
> >        table.setAutoFlush(false);
> >        table.setWriteBufferSize(1024 * 1024 * 12);
> >
> > 3)  For every Put I do this:
> >          Put put = new Put(Bytes.toBytes(out));
> >          put.setWriteToWAL(false);
> >
> > 4)  Change the No. of Reducers as per the No. of Workers.  I believe the
> > formula is:  # of workers * 1.75.
> >
> > Any other hints?  As always, greatly appreciate the help.  Thanks.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message