hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: hbase hashing algorithm and schema design
Date Thu, 09 Jun 2011 01:02:16 GMT
Sam, would HBaseWD help you here?
See 
http://search-hadoop.com/m/AQ7CG2GkiO/hbasewd&subj=+ANN+HBaseWD+Distribute+Sequential+Writes+in+HBase


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: Sam Seigal <selekt86@yahoo.com>
> To: user@hbase.apache.org
> Cc: joey@cloudera.com; tsunanet@gmail.com
> Sent: Wed, June 8, 2011 4:54:24 PM
> Subject: Re: hbase hashing algorithm and schema design
> 
> On Wed, Jun 8, 2011 at 12:40 AM, tsuna <tsunanet@gmail.com> wrote:
> 
> >  On Tue, Jun 7, 2011 at 7:56 PM, Kjew Jned <selekt86@yahoo.com> wrote:
> > >  I was studying the OpenTSDB example, where they also prefix the row keys
> >  with
> > > event id.
> > >
> > > I further modified my row  keys to have this ->
> > >
> > > <eventid>  <uuid>  <yyyy-mm-dd>
> > >
> > > The uuid is  fairly unique and random.
> > > Is appending a uuid to the event id help  the distribution ?
> >
> > Yes it will help the distribution, but it  will also make certain query
> > patterns harder.  You can no longer  scan for a time range, for a given
> > eventid.  How to solve this  problem depends on how you generate the
> > UUIDs.
> >
> > I  wouldn't recommend doing this unless you've already tried simpler
> >  approaches and reached the conclusion that they don't work.  Many
> >  people seem to be afraid of creating hot spots in their tables without
> >  having first-hand evidence that the hot spots would actually be a
> >  problem.
> >
> >
> >
> >
> Can I not use regex row filters to  query for date ranges ? There is an added
> overhead for the client to
> order  them, and it is not an efficient query, but it is certainly possible
> to do so  ? Am I wrong ?
> 
> 
> 
> > > Let us say if I have 4 region servers to  start off with and I start the
> >
> > If you have only 4 region  servers, your goal should be to have roughly
> > 25% of writes going to each  server.  It doesn't matter if the 25%
> > slice of one server is going  to a single region or not.  As long as
> > all the writes don't go to  the same row (which would cause lock
> > contention on that row), you'll get  the same kind of performance.
> >
> 
> I am worried about the following  scenario, hence putting a lot of thought
> into how to design this  schema.
> 
> For example, for simplicity sake, I only have two event Ids A and  B, and the
> traffic is equally distributed
> between them i.e. 50% of my  traffic is event A and 50% is event B. I have
> two region servers running,  on
> two physical nodes with the following schema -
> 
> <eventid>  <timestamp>
> 
> Ideally, I now have all of A traffic going into  regionServerA and all of B
> traffic going into regionserver B.
> The cluster  is able to hold this traffic, and the write load is  distributed
> 50-50.
> 
> However, now I reach a point where I need to scale,  since the two clusters
> are not being able to
> cope with the write traffic.  Adding extra regionservers to the cluster is
> not going to make any  difference
> , since only the physical machine holding the tail end of the  region is the
> one that will receive
> the traffic. Most of my other cluster  is going to be idle.
> 
> To generalize, if I want to scale where the # of  machines is greater than
> the # of unique event ids, I have no way  to
> distribute the load in an efficient manner, since I cannot distribute  the
> load of a single event id across multiple machines
> (without adding a  uuid somewhere in the middle and sacrificing data locality
> on ordered  timestamps).
> 
> Do you think my concern is genuine ? Thanks a lot for your  help.
> 
> 
> >
> > > workload, how does HBase decide how many  regions is it going to create,
> > and what
> > > key is going to go  into what region ?
> >
> > Your table starts as a single region.   As this region fills up, it'll
> > split.  Where it split is chosen by  HBase.  HBase tries to spit the
> > region "in the middle", so that  roughly the number of keys ends up in
> > each new daughter  region.
> >
> > You can also manually pre-split a table (from the  shell).  This can be
> > advantageous in certain situations where you  know what your table will
> > look like and you have a very high write  volume coupled with
> > aggressive latency requirements for >95th  percentile.
> >
> 
> > > I could have gone with something  like
> > >
> > > <uuid><eventid><yyyy-mm-dd> ,  but would not like to, since
my queries are
> > always
> > > going to  be against a particular event id type, and i would like them to
> >  be
> > > spatially located.
> >
> > If you have a lot of data per  <eventid>, then putting the <uuid> in
> > between the  <eventid> and the <yyyy-mm-dd> will screw up data locality
> >  anyway.  But the exact details depend on how you pick the  <uuid>.
> >
> > --
> > Benoit "tsuna" Sigoure
> > Software  Engineer @ www.StumbleUpon.com
> >
> 

Mime
View raw message