hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: RowCounter example run time
Date Sun, 23 May 2010 23:32:17 GMT


Ryan & JD

I'm aware of the difficulties in trying to maintain an accurate row count.
Its not trivial, but its not rocket science either. 

There are a couple of ways of doing this and it will take some time to think through the benefits
vs the costs of how you do it.

You're right. Its more difficult than a c-isam single machine database.
There are tricks one could take.

But I think that this should be taken offline and maybe open a JIRA issue, if one doesn't
already exist?

-Mike


> Date: Sun, 23 May 2010 11:20:17 -0700
> Subject: Re: RowCounter example run time
> From: ryanobjc@gmail.com
> To: user@hbase.apache.org
> 
> The select count(*) optimization is a classic in databases - some
> people argue that it's really important and should be optimized for
> (myisam for example) and others note that it's a trick and real DB
> loads rarely use that on a sizable table.  Note that myisam locks the
> entire table for each update (only 1 update at a time) so comparing
> hbase to it is odd.  Innodb doesn't (maintaining global stats under
> performance can be difficult).  Oracle doesn't (but may be able to use
> a primary index to reduce the blocks read).
> 
> Implementing this in HBase might be difficult - when a new column is
> inserted into a table the regionserver doesn't know if that row
> already exists - to know that it would have to read some data
> potentially from disk first.  Any scheme that requires the
> regionserver to increment a "rowsForRegion" during certain inserts
> would therefore be problematic.
> 
> As JD noted, the likely cause here is scanner pre-fetch caching.  We
> ship with very conservative scanner pre-fetch values because if a
> client takes too long they will get a fatal exception.  RowCounter MR
> jobs shouldn't be like that however.
> 
> As for cluster sizing - 6-10 is the minimum really.  With 3 nodes you
> are replicating data to every node, and you arent getting the benefits
> of a clustered solution.  At higher node counts you get some disjoint
> parallelism underway and things really pick up on the larger datasets
> (I can do MapReduces at 7-8m rows/sec for 20+ minutes on end).
> 
> -ryan
> 
> 
> On Sun, May 23, 2010 at 7:58 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
> > On Sun, May 23, 2010 at 10:36 AM, Michael Segel
> > <michael_segel@hotmail.com>wrote:
> >
> >>
> >> J-D,
> >>
> >> Here's the problem.. you go to any relational database and do a select
> >> count(*) and you get a response back fairly quickly.
> >> The difference is that in HBase, you're doing a physical count and with the
> >> relational engine you're pulling it from meta data.
> >>
> >> I have a couple of ideas on how we could do this...
> >>
> >> -Mike
> >>
> >> > Date: Sat, 22 May 2010 09:25:51 -0700
> >> > Subject: Re: RowCounter example run time
> >> > From: jdcryans@apache.org
> >> > To: user@hbase.apache.org
> >> >
> >> > My first question would be, what do you expect exactly? Would 5 min be
> >> > enough? Or are you expecting something more like 1-2 secs (which is
> >> > impossible since this is mapreduce)?
> >> >
> >> > Then there's also Jon's questions.
> >> >
> >> > Finally, did you set a higher scanner caching on that job?
> >> > hbase.client.scanner.caching is the name of the config, which defaults
> >> > to 1. When mapping a HBase table, if you don't set it higher you're
> >> > basically benchmarking the RPC layer since it does 1 call per next()
> >> > invocation. Setting the right value depends on the size of your rows
> >> > eg are you storing 60 bytes or something high like 100KB? On our 13B
> >> > rows table (each row is a few bytes), we set it to 10k.
> >> >
> >> > J-D
> >> >
> >> > On Sat, May 22, 2010 at 8:40 AM, Andrew Nguyen
> >> > <andrew-lists-hbase@ucsfcti.org> wrote:
> >> > > Hello,
> >> > >
> >> > > I finally got some decent hardware to put together a 1 master, 4 slave
> >> Hadoop/HBase cluster.  However, I'm still waiting for space in the
> >> datacenter to clear out and only have 3 of the nodes deployed (master + 2
> >> slaves).  Each node is a quad-core AMD with 8G of RAM, running on a GigE
> >> network.  HDFS is configured to run on a separate (from the OS drive) U320
> >> drive.  The master has RAID1 mirrored drives only.
> >> > >
> >> > > I've installed HBase with slave1 and slave2 as regionservers and
> >> master, slave1, slave2 as the ZK quorom.  The master serves as the NN and JT
> >> and the slaves as DN and TT.
> >> > >
> >> > > Now my question:
> >> > >
> >> > > I've imported 22.5M rows into HBase, into a single table.  Each row
has
> >> 8 or so columns.  I just ran the RowCounter MR example and it takes about 25
> >> minutes to complete.  Is a 3 node setup too underpowered to combat the
> >> overhead of Hadoop and HBase?  Or, could it be something with my
> >> configuration?  I've been playing around with Hadoop some but this is my
> >> first attempt at anything HBase.
> >> > >
> >> > > Thanks!
> >> > >
> >> > > --Andrew
> >>
> >> _________________________________________________________________
> >> The New Busy is not the too busy. Combine all your e-mail accounts with
> >> Hotmail.
> >>
> >> http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
> >>
> >
> > Every system has its tradeoff. In the example above:
> >
> >>> select count(*) and you get a response back fairly quickly.
> >
> > Try this with my isam very fast. Try that will innodb, this takes a very
> > long time. Some systems maintain a row count and some do not.
> >
> > Now if you are using innodb there is a quick way to get an approximate row
> > count.
> >
> > explain select count(*)
> >
> > This causes the innodb engine to use indexes for an approximate table size.
> >
> > HBase does not maintain a row count. The row count is intensive process as
> > it scans every row. Such is life.
> >
 		 	   		  
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message