hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Pallas <pal...@cs.stanford.edu>
Subject Re: Region Splitting for moderate amount of daily data - Improve MapReduce Performance
Date Fri, 15 Apr 2011 19:48:48 GMT

On Apr 14, 2011, at 12:18 PM, David Schnepper wrote:

> Could it be that your row key is not distributing the data well enough?
> That is, if your key is primarily based on the current date, it will only put the
> data into a small number of regions.

This, I have come to realize, is an essential difference between the Cassandra approach and
the HBase approach.  With HBase, your keys can be randomly distributed over the entire keyspace,
but if all your data fits in a single region, then all your requests are going to a single
regionserver.  

The only ways I know around this are to make the split threshold low or to pre-split the table.
 If you make the split threshold low, you get distribution for smaller tables, but if the
tables get big, you have the overhead of more regions to deal with.  If you pre-split the
table, you're in good shape provided you know the key distribution in advance (although I
am concerned about possible bugs involving empty regions, based on one recent experience).

It seems that, until you have enough data relative to your cluster size, you must choose between
locality and distribution.  (When you have enough data, you get a better balance between the
two.)

The HBase rebalancer, as I understand it, adjusts region assignments, but doesn't adjust split
points (hence, the number of regions).  Maybe that would be a useful feature for some cases.

joe


Mime
View raw message