hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Seigal <selek...@yahoo.com>
Subject Re: pre splitting tables
Date Tue, 25 Oct 2011 08:53:12 GMT
On Mon, Oct 24, 2011 at 9:22 PM, Karthik Ranganathan
<kranganathan@fb.com> wrote:
>
>
> << ...mod the hash with the number of machines I have... >>
> This means that the data will change with the number of machines - so all
> your data will map to different regions if you add a new machine to your
> cluster.
>
>
> << What I do not understand is the advantages/disasvantages of having
> regions that are too big vs regions that are too thin. >>
> The disadvantage is that some regions (and consequently nodes) will have a
> lot of data which will adversely affect things like storage (if dfs is
> local to that node), block cache hit ratio, etc.

Can you please explain a bit more on how a bigger region size will
affect the block cache hit ratio ?

>
> In general - per our experience using Hbase, its much more desirable to
> disperse data up-front. If you are building indexes using MR, then you
> probably don¹t need range scan ability on your keys.
>
> Thanks
> Karthik
>
>
>
> On 10/24/11 4:48 PM, "Sam Seigal" <selekt86@yahoo.com> wrote:
>
>>According to my understanding, the way that HBase works is that on a
>>brand new system, all keys will start going to a single region i.e. a
>>single region server. Once that region
>>reaches a max region size, it will split and then move to another
>>region server, and so on and so forth.
>>
>>Initially hooking up HBase to a prod system, I am concerned about this
>>behaviour, since a clean HBase cluster is going to experience a surge
>>of traffic all going into one region server initially.
>>This is the motivation behind pre-defining the regions, so the initial
>>surge of traffic is distributed evenly.
>>
>>My strategy is to take the incoming data, calculate the hash and then
>>mod the hash with the number of machines I have. I will split the
>>regions according to the prefix # .
>>This should , I think provide for better data distribution when the
>>cluster first starts up with one region / region server.
>>
>>These regions should then grow fairly uniformly. Once they reach a
>>size like ~ 5G, I can do a rolling split.
>>
>>Also, I want to make sure my regions do not grow too much in size that
>>when I end up adding more machines, it does not take a very long time
>>to perform a rolling split.
>>
>>What I do not understand is the advantages/disasvantages of having
>>regions that are too big vs regions that are too thin. What does this
>>impact ? Compaction time ? Split time ? What is the
>>concern when it comes to how the architecture works. I think if I
>>understand this better, I can manage my regions more efficiently.
>>
>>
>>
>>On Mon, Oct 24, 2011 at 3:23 PM, Nicolas Spiegelberg
>><nspiegelberg@fb.com> wrote:
>>> Isn't a better strategy to create the HBase keys as
>>>
>>> Key = hash(MySQL_key) + MySQL_key
>>>
>>> That way you'll know your key distribution and can add new machines
>>> seamlessly.  I'm assuming that your rows don't overlap between any 2
>>> machines.  If so, you could append the MACHINE_ID to the key (not
>>> prepend).  I don't think you want the machine # as the first dimension
>>>on
>>> your rows, because you want the data from new machines to be evenly
>>>spread
>>> out across the existing regions.
>>>
>>>
>>> On 10/24/11 9:07 AM, "Stack" <stack@duboce.net> wrote:
>>>
>>>>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal <selekt86@yahoo.com> wrote:
>>>>> According to the HBase book , pre splitting tables and doing manual
>>>>> splits is a better long term strategy than letting HBase handle it.
>>>>>
>>>>
>>>>Its good for getting a table off the ground, yes.
>>>>
>>>>
>>>>> Since I do not know what the keys from the prod system are going to
>>>>> look like , I am adding a machine number prefix to the the row keys
>>>>> and pre splitting the tables  based on the prefix (prefix 0 goes to
>>>>> machine A, prefix 1 goes to machine b etc).
>>>>>
>>>>
>>>>You don't need to do inorder scan of the data?  Whats the rest of your
>>>>row key look like?
>>>>
>>>>
>>>>> Once I decide to add more machines, I can always do a rolling split
>>>>> and add more prefixes.
>>>>>
>>>>
>>>>Yes.
>>>>
>>>>> Is this a good strategy for pre splitting the tables ?
>>>>>
>>>>
>>>>So, you'll start out with one region per server?
>>>>
>>>>What do you think the rate of splitting will be like?  Are you using
>>>>default region size or have you bumped this up?
>>>>
>>>>St.Ack
>>>
>>>
>
>

Mime
View raw message