hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Seigal <selek...@yahoo.com>
Subject Re: pre splitting tables
Date Thu, 27 Oct 2011 02:34:22 GMT
On Tue, Oct 25, 2011 at 1:02 PM, Nicolas Spiegelberg
<nspiegelberg@fb.com> wrote:
>>According to my understanding, the way that HBase works is that on a
>>brand new system, all keys will start going to a single region i.e. a
>>single region server. Once that region
>>reaches a max region size, it will split and then move to another
>>region server, and so on and so forth.
> Basically, the default table create is 1 region per table that can only go
> to 1 RS.  Splits happen on that region when it gets large enough, but
> balancing the new region to another server is an asynchronous event and
> doesn't happen immediately after the first split because of
> "hbase.regions.slop".  The idea is to create the table with R regions
> across S servers so each server has R/S regions and puts will be roughly
> uniformly distributed across the R regions, keeping every server equally
> busy.  Sounds like you have a good handle on this behavior.
>>My strategy is to take the incoming data, calculate the hash and then
>>mod the hash with the number of machines I have. I will split the
>>regions according to the prefix # .
>>This should , I think provide for better data distribution when the
>>cluster first starts up with one region / region server.
> The problem with this strategy: so say you split into 256 regions.  Region
> splits are basically memcmp() ranges, so they would look like this:
> Key prefix      Region
> 0x00 - 0x01     1
> 0x01 - 0x02     2
> 0x02 - 0x03     3
> ...

Aren't the regions boundaries going to be something like this:

0x000000000000 ...
0xf0ffffffffffffffffffffffffff ...
0x1ffffffffffffffffffffffff ...

The idea is that once the region starts growing, I continue doing
manual splits, the data
will be split at boundaries at a finer grain than just the machine
prefix, and once good
distribution is achieved, I basically stop doing the splits.

When adding more machines, a rolling split will have to be performed
in any case, right ?
The application logic can also be modified for adding more machine prefixes.

> Let's say you prefix your key with the machine ID#.  You are probably
> using a UINT32 for the machine ID, but let's assume that your using a
> UINT8 for this discussion.  Your M machine IDs would map to exactly 1
> region each.  So only M out of 256 regions would contain any data.  Every
> time you add a new machine, all the puts will only go to one region.  By
> prefixing your key with a hash (MD5, SHA1, whatevs), you'll get random
> distribution on your prefix and will populate all 256 regions evenly.

For my use case, I am generating indexes for the reads, so prefixing a hash
should not be a problem.

But there is also a broader question. By prefixing the hash, I am
losing a lot in
data locality, as well as the ability to query on the leading position
of the key.
Doesn't this make a predictable set of machine id's a better candidate ?

> (FYI: If you were using a monotonically increasing UINT32, this would be
> worse because they'd probably be low numbers and all map to the 0x00
> region).
>>These regions should then grow fairly uniformly. Once they reach a
>>size like ~ 5G, I can do a rolling split.
> I think you want to do rolling splits very infrequently.  We have PB of
> data and only rolling split twice.  The more regions / server, the faster
> server recovery happens because there's more parallelism for distributed
> log splitting.  If you have too many regions / cluster, you can overwork
> the load balancer on the master and increase startup time.  We're running
> at 20 regions/server * 95 regionservers ~= 1900 regions on the master.
> How many servers do you have?  If you have M servers, I'd try to split
> into M*(M-1) regions but keep that value lower than 1000 if possible.
> It's also nice to split on log2 boundaries for readability, but
> technically the optimal scenario is to have 1 server die and have every
> other server split & add exactly the same amount of its regions. M*(M-1)
> would give every other server 1 of the regions.
>>Also, I want to make sure my regions do not grow too much in size that
>>when I end up adding more machines, it does not take a very long time
>>to perform a rolling split.
> Meh.  This happens.  This is also what rolling split is designed/optimized
> for.  It persists the split plan to HDFS so you can stop it and restart
> later.  It also round robin's the splits across servers & prioritizes
> splitting low-loaded servers to minimize region-based load balancing
> during the process.
>>What I do not understand is the advantages/disasvantages of having
>>regions that are too big vs regions that are too thin. What does this
>>impact ? Compaction time ? Split time ? What is the
>>concern when it comes to how the architecture works. I think if I
>>understand this better, I can manage my regions more efficiently.
> Are you IO-bound or CPU-bound?  The default HBase setup is geared towards
> the CPU-bound use case and you should have acceptable performance out of
> the box after pre-splitting and a couple well-documented configs (ulimit
> comes to mind).  In our case, we have medium sized data and are fronted by
> an application cache, so we're IO-bound.  Because of this, we need to
> tweak the compaction algorithm to minimize IO writes at the expense of
> StoreFiles and use optional features like BloomFilters & the TimeRange
> Filter to minimize the number of StoreFiles we need to query on a get.

The main purpose of the cluster is generating aggregations that can be
tied back to the
transaction level details.

So lets say I have a few transaction come in the form:

eventid_eventKey1 : <attr1: 5, attr2 : 10 ..>
eventid_eventKey1 : <attr1: 2, attr2 : 9 ..>
eventid_eventKey1 : <attr1: 1, attr2 : 2 ..>

Our clients are interested in aggregations at various dimensions of
the incoming data.
So, a map reduce job will run and produce an index on these attributes:


another job will then produce the aggregation:

attr1: 8

(There are plans to capture these job dependencies in some sort of a
DAG job workflow framework as requirements get complex)

Our customers want to be able to tie back the aggregations to the
transaction level details. This is why HBase is
a great choice. We can query an "aggregations" table to get the
aggregations and then use the same key to query a
"transaction detail table" to figure out the attributes that make up
the aggregation.

So long story short, the read volume from concurrent users is going to
be very low. We expect no more than 20 people to be using the system
at a given time.
The write volume on the other hand is going to be high, and there will
be stress on the cluster mostly due to map reduce jobs creating
indexes and
performing aggregations on those indexes.

What would you suggest would be the config parameters to look at for
this sort of a scenario ?

To start off with a shadow HBase cluster will only take up around some
% of the production traffic which is around 20,0000 messages / hr. If
we keep on adding more machines and start
transferring more and more traffic, this can reach around 500,000
messages / hr.

Right now, the hardware I have available to do my testing is 5
machines - 8 GB memory, 2.26 GHz, quad core, 120 GB disk space

Thanks a lot for your help !


View raw message