hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@yahoo.com>
Subject Re: Region Splitting for moderate amount of daily data - Improve MapReduce Performance
Date Fri, 15 Apr 2011 22:48:28 GMT
> From: Joe Pallas <pallas@cs.stanford.edu>

> > Could it be that your row key is not distributing the
> > data well enough?
> > That is, if your key is primarily based on the current
> > date, it will only put the data into a small number of
> > regions.
> This, I have come to realize, is an essential difference
> between the Cassandra approach and the HBase approach. 
> With HBase, your keys can be randomly distributed over the
> entire keyspace, but if all your data fits in a single
> region, then all your requests are going to a single
> regionserver.  

Yes, BigTable == distributed ordered table; Cassandra == hash partitioned ring typically.
(With great simplification.) Because HBase is a DOT it can provide strongly consistent and
atomic operations on rows, because rows exist in only one place at a time. This is a feature,
or a problem, or both, depending on your use case.

> The only ways I know around this are to make the split
> threshold low or to pre-split the table.  If you make
> the split threshold low, you get distribution for smaller
> tables, but if the tables get big, you have the overhead of
> more regions to deal with.

The split point is adjustable. It can be set as a table attribute on a per-table basis. Start
small and revise upward after enough regions are split so the table itself is well distributed.
This assumes the keys used while inserting were consistent with the expected distribution
of the application.

With HBase 0.90 changing the schema requires disabling the table, making the schema change,
then enabling the table again.

With HBase 0.92, attribute changes like changing the split point won't require a disable/enable.

> If you pre-split the table,
> you're in good shape provided you know the key distribution
> in advance (although I am concerned about possible bugs
> involving empty regions, based on one recent experience).

Empty regions or underutilized regions can be merged (offline). Disable the table, use the
Merge utility, then enable the table. Online merge is on the roadmap. It might be in 0.92,
if not than the next.

   - Andy

View raw message