hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Seigal <selek...@yahoo.com>
Subject Re: hbase hashing algorithm and schema design
Date Wed, 08 Jun 2011 20:54:24 GMT
On Wed, Jun 8, 2011 at 12:40 AM, tsuna <tsunanet@gmail.com> wrote:

> On Tue, Jun 7, 2011 at 7:56 PM, Kjew Jned <selekt86@yahoo.com> wrote:
> > I was studying the OpenTSDB example, where they also prefix the row keys
> with
> > event id.
> >
> > I further modified my row keys to have this ->
> >
> > <eventid> <uuid>  <yyyy-mm-dd>
> >
> > The uuid is fairly unique and random.
> > Is appending a uuid to the event id help the distribution ?
>
> Yes it will help the distribution, but it will also make certain query
> patterns harder.  You can no longer scan for a time range, for a given
> eventid.  How to solve this problem depends on how you generate the
> UUIDs.
>
> I wouldn't recommend doing this unless you've already tried simpler
> approaches and reached the conclusion that they don't work.  Many
> people seem to be afraid of creating hot spots in their tables without
> having first-hand evidence that the hot spots would actually be a
> problem.
>
>
>
>
Can I not use regex row filters to query for date ranges ? There is an added
overhead for the client to
order them, and it is not an efficient query, but it is certainly possible
to do so ? Am I wrong ?



> > Let us say if I have 4 region servers to start off with and I start the
>
> If you have only 4 region servers, your goal should be to have roughly
> 25% of writes going to each server.  It doesn't matter if the 25%
> slice of one server is going to a single region or not.  As long as
> all the writes don't go to the same row (which would cause lock
> contention on that row), you'll get the same kind of performance.
>

I am worried about the following scenario, hence putting a lot of thought
into how to design this schema.

For example, for simplicity sake, I only have two event Ids A and B, and the
traffic is equally distributed
between them i.e. 50% of my traffic is event A and 50% is event B. I have
two region servers running, on
two physical nodes with the following schema -

<eventid> <timestamp>

Ideally, I now have all of A traffic going into regionServerA and all of B
traffic going into regionserver B.
The cluster is able to hold this traffic, and the write load is distributed
50-50.

However, now I reach a point where I need to scale, since the two clusters
are not being able to
cope with the write traffic. Adding extra regionservers to the cluster is
not going to make any difference
, since only the physical machine holding the tail end of the region is the
one that will receive
the traffic. Most of my other cluster is going to be idle.

To generalize, if I want to scale where the # of machines is greater than
the # of unique event ids, I have no way to
distribute the load in an efficient manner, since I cannot distribute the
load of a single event id across multiple machines
(without adding a uuid somewhere in the middle and sacrificing data locality
on ordered timestamps).

Do you think my concern is genuine ? Thanks a lot for your help.


>
> > workload, how does HBase decide how many regions is it going to create,
> and what
> > key is going to go into what region ?
>
> Your table starts as a single region.  As this region fills up, it'll
> split.  Where it split is chosen by HBase.  HBase tries to spit the
> region "in the middle", so that roughly the number of keys ends up in
> each new daughter region.
>
> You can also manually pre-split a table (from the shell).  This can be
> advantageous in certain situations where you know what your table will
> look like and you have a very high write volume coupled with
> aggressive latency requirements for >95th percentile.
>

> > I could have gone with something like
> >
> > <uuid><eventid><yyyy-mm-dd> , but would not like to, since my
queries are
> always
> > going to be against a particular event id type, and i would like them to
> be
> > spatially located.
>
> If you have a lot of data per <eventid>, then putting the <uuid> in
> between the <eventid> and the <yyyy-mm-dd> will screw up data locality
> anyway.  But the exact details depend on how you pick the <uuid>.
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message