hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans" <jdcry...@gmail.com>
Subject Re: Writes - Poor performance on EC2
Date Thu, 21 Aug 2008 18:50:29 GMT
You have at least 1 region in a freshly created table. You can see this in
the web UI.

Performance is very poor when inserting data in a fresh table since there is
only one region. Try doing incremental batches of updates (starting with...
let's say 100 row) while looking at your number of regions.

Having small instances means having only 1 CPU means also poor performances.
As I said in another thread, Hadoop and HBase are heavily multi threaded.

J-D

On Thu, Aug 21, 2008 at 2:44 PM, Manish Katyal <manish.katyal@gmail.com>wrote:

> Please see inline:
>
> On Thu, Aug 21, 2008 at 1:32 PM, Jean-Daniel Cryans <jdcryans@gmail.com
> >wrote:
>
> > Manish,
> >
> > Some questions:
> >
> > - Which version of Hadoop/HBase?
>
> 0.17.1 and 0.2.0
>
> >
> >
> > - Which type of EC2 instance?
>
> small
>
> >
> >
> > - How many regions does your table have at the beginning of the
> experiment?
>
> I'm not clear about this question - there is no data in the table and thus
> I
> don't know how many regions the table has???
>
>
>
> >
> > Thx,
> >
> > J-D
> >
> > On Thu, Aug 21, 2008 at 2:23 PM, Manish Katyal <manish.katyal@gmail.com
> > >wrote:
> >
> > > By looking at the iostat numbers, it appears the problem is that my
> data
> > is
> > > being inserted in the reduce step - as a result only 2 of the region
> > > servers
> > > (# equal to tasktrackers) are being used at any given time (in fact,
> are
> > > getting slammed while the others are idle).
> > > I guess the solution is:
> > > - either randomly sort the data so the writes will be performed against
> > > different region servers (load balancing). The downside, the writes
> will
> > > take longer.
> > > - Or, increase the number of task trackers to be equal to the number of
> > > region servers (and hopefully because of the way the input files are
> > > split),
> > > effectively use all region servers concurrently.
> > >
> > > Any ideas?
> > >
> > > - Manish
> > >
> > > On Thu, Aug 21, 2008 at 10:56 AM, Manish Katyal <
> manish.katyal@gmail.com
> > > >wrote:
> > >
> > > > I'm running an experiment on EC2 (10 node cluster) that involves
> > > inserting
> > > > 12 million records (about 1.6GB) of data into HBase. The data is in
> > HDFS
> > > and
> > > > I'm running M/R jobs to write to HBase.
> > > > The performance has been very poor - my M/R jobs have been timing out
> > > even
> > > > though the timeout has been set to 1800 seconds. Were it not for the
> > > > timeouts, I estimate it would have taken 10 or 12 hours to insert the
> > > data.
> > > >
> > > > Is this expected performance? Am I doing something wrong here?
> > > >
> > > > Configuration of the 10 small nodes on EC2:
> > > > - 5 Region servers - each running a data node
> > > > - 1 dedicated HBase Master Server
> > > > - 1 JobTracker server + datanode
> > > > - 1 server for Namenode and Secondary namenode
> > > > - 2 servers running the Task Trackers and Datanodes
> > > >
> > > > Any help or directions would be appreciated.
> > > >
> > > > Thanks,
> > > > - Manish Katyal
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message