geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: Web State Replication... (long)
Date Sat, 01 Nov 2003 13:55:42 GMT
Further thoughs :

in (2) you'll notice that each node ends up carrying ((n-1)*(b-1))+b 
tranches,
so the number of tranches will grow along with the number of nodes in the
cluster. This, you might think, will lead to scalability issues - it would,
excepting for the fact that, as n increases the size of each tranche 
decreases
accordingly.

In (3) where we are left with state that needs to be rebalanced after  
the loss
of a node, there is at least one further option which  we have not 
considered.

- since the Blue state is already nicely balanced across the cluster 
there is
no immediate need to move it. If we adopt the strategy of waiting for 
sessions
to be pulled out of it and adopted elsewhere as and when needed, and in the
meantime another node joins the cluster, we could simply call it 'Blue' and
allow it to replace the node we just lost with minumum fuss. If these 
tranches
 eventually became empty of sessions due to timeout, passivation and
 migration, we could drop them from the cluster anyway.



Jules



Jules Gosnell wrote:

>
> OK,
>
> Here is the latest and greatest.
>
> I have introduced a new parameter - 't' - the number of 'tranches'
> that each nodes state is cut into so that it can be replicated across
> a collection of different nodes instead of just going to one single
> one. This is in response to feedback on this thread.
>
> Initially, I didn't see much benefit in this extra complexity, but now
> I am coming round to it :-)
>
> If you want to full understand the contents of this posting, you will
> need to have digested the previous posting I made with a similar
> diagram.
>
> I'll walk through the same scenario that I presented earlier with t=1,
> except now I will go to the opposite end of the spectrum and make
> t=(n-1) - i.e. each node splits its state into the same number of
> tranches as there are nodes in the cluster (excepting itself) and
> stores one tranche with each. This is further complicated by the
> parameter 'b' (number of buddies in a partition), which now
> effectively becomes the number of copies of each tranche present in
> the cluster.
>
> (1)
>
> n=3
> b=3
> t=(n-1)=2
>
> Red splits it's primary state into (n-1) tranches and starting with
> the first tranche and replicates a copy of it to the following 'b'-1
> nodes. Then it takes the next tranche, starts one node further out and
> does the same thing. It excludes itself from this process and simply
> wraps around the clock if it runs out of nodes.
>
> Each node does exactly the same thing, resulting in state being
> equally balanced around the cluster.
>
>
> (2)
>
> n=4
> b=3
> t=(n-1)=3
>
> Blue joins the cluster.
>
> Everyone increases their number of primary tranches by one and
> allocates (b-1) backup tranches for Blue.
>
> The cluster reorganises the replication relationships between
> nodes:tranches in accordance with the algorithm specified above. (I
> have to work out the nitty gritty of how efficiently I can do this).
>
>
> (3)
>
> n=3
> b=3
> t=(n-1)=2
>
> Blue leaves the cluster.
>
> The diagram shows the state immediately after Blue has left - before a
> rebalancing of state.
>
> I suggest that Red, Green and Yellow tranches rearrange themselves as
> efficiently as possible back to the same layout as in (1). We are then
> left with every node carrying an extra b-1 Blue tranches.
>
> The sessions contained in these can either be proactively merged into
> other nearby primary tranches and replicated according to their
> partioning. Or lazily pulled onto the next node to receive a request
> that requires them and assimilated at that point, or perhaps pushed
> straight out to shared store, to be lazily loaded and adopted, as
> required, by which ever node first needs them..
>
>
>
> I shall look at genericising my initial design to allow the
> parameterisation of 't' to a value between 1 and n-1.
>
> 't' will basically control whether the joining/leaving of a node
> wreaks a large amount of havoc on a small amount of nodes or a small
> amount on a large amount.
>
>
> That's it for now,
>
>
> Jules
>
>
>
> Jules Gosnell wrote:
>
>> Hmmm...
>>
>> I've give the more-than-one-replication-bucket-per-node a  little 
>> more thought ...
>>
>> I'm not sure that the extra complexity will merit the perceived gains 
>> in terms of balancing load associated with cluster growing/shrinking 
>> around the whole cluster, instead of just between the nodes 
>> immediately surrounding the point of node entry/exit, however, this 
>> is an area that we should consider more closely. Perhaps we could 
>> even generalise the algorithm to allow the configuration of 
>> replication-buckets-per-node....
>>
>> I'll keep on it.
>>
>> Jules
>>
>>
>>
>> Jules Gosnell wrote:
>>
>>> Guys,
>>>
>>> I understand exactly what you are both saying and you can relax - at 
>>> migration time, I am working at the session level - that is one 
>>> bucket=one session - so if you have 10 sessions and you want to 
>>> leave a cluster of 11 nodes, provided that your load-balancer can 
>>> handle it, you can migrate 1 session to each node.
>>>
>>> However at replication time 1 bucket=the whole state of the node - 
>>> i.e. replication groups are organised at the node level - not at 
>>> single session level. Having each session remember where each of 
>>> it's copies is is just too much overhead and as I pointed out in my 
>>> last mail, I can't see any advantage in terms of resilience in every 
>>> node holding (n*1/n)*b*s or 1*b*s sessions, or some division between 
>>> - the point is it will always add up to b*s, which is the number of 
>>> sessions and backups that will need to rehomed if you lose a node. 
>>> It is the granularity at which the rehoming takes place that is 
>>> important, and as I have show, this is the most granular it can be.
>>>
>>> Of course, their is no reason why all migration should be done at 
>>> the single session level - load-balancer allowing - a node could put 
>>> in a bid for several thousand sessions and have them all batched and 
>>> migrated across in a single transaction.
>>>
>>> We are describing pretty much the same thing in different terms.
>>>
>>> Happier :-)  ?
>>>
>>>
>>> Jules
>>>
>>>
>>> Dain Sundstrom wrote:
>>>
>>>> Jules,
>>>>
>>>> IIRC James point is having a lot more buckets then nodes makes 
>>>> adding and reorganizing state much easier.  Of course in the case 
>>>> of a failure you still have bulk transfer of data, but the bulk 
>>>> transfer is spread across the cluster.  This help avoid a dominions 
>>>> style cascade delete where the first node dies and then is backup 
>>>> dies from the bulk transfer load, and then that backup dies and so on.
>>>>
>>>> Anyway, I think the big benefit is the ease of redistributing 
>>>> sessions.  Instead of a new node saying I'll take these 3k 
>>>> sessions, it says I'll take these three buckets.  The load is much 
>>>> less but I think the biggest benefit is the code should be easier 
>>>> to debug, understand and write.
>>>>
>>>> It is not important now.  As long as we keep the interface simple 
>>>> and clean we can try many implementations until something fits.
>>>>
>>>> -dain
>>>>
>>>> On Thursday, October 30, 2003, at 11:43 AM, Jules Gosnell wrote:
>>>>
>>>>> James Strachan wrote:
>>>>>
>>>>>> n Thursday, October 30, 2003, at 12:19  pm, gianny DAMOUR wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Just a couple of questions regarding this design:
>>>>>>>
>>>>>>> - Is it possible to configure the weight of a node? If yes, is

>>>>>>> the same auto-partitioning policy applicable? My concern is that

>>>>>>> a "clockwise" policy may add a significant load on nodes hosted

>>>>>>> by low spec hosts.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is partly a problem for the sticky load balancer to deal 
>>>>>> with i.e. it should load requests to primary machines based on 
>>>>>> spec/power.
>>>>>>
>>>>>> If we partitioned the session data into buckets (rather than one

>>>>>> big lump), then the buckets of session data can be distributed 
>>>>>> evenly around the cluster so that each session bucket has N 
>>>>>> buddies (replicas) but that a load-balancing algorithm could be 
>>>>>> used to distribute the buckets based on (say) a host spec 
>>>>>> weighting or whatnot. e.g. nodes in the cluster could limit how 
>>>>>> many buckets to accept due to their lack of resources etc.
>>>>>>
>>>>>> Imagine having 1 massive box and 2 small ones in a cluster - 
>>>>>> you'd probably want to give the big box more buckets than the 
>>>>>> smaller ones. The previous model Jules described still holds 
>>>>>> (that was a view of 1 session bucket) - its just that the total 
>>>>>> session state for a machine might be spread over many buckets.
>>>>>>
>>>>>> Having multiple buckets could also help spread the load of 
>>>>>> recovering from a node failure in larger clusters.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> James, I have given this quite a bit of thought... and whilst it 
>>>>> was initially appealing and seemed a sensible extension of my 
>>>>> train of thought, I have not been able to find any advantage in 
>>>>> splitting one nodes state into mutiple buckets....
>>>>>
>>>>> If a node joins or leaves, you still have exactly the same amount 
>>>>> of state to shift around the cluster.
>>>>>
>>>>> If you back up your sessions off-node, then whether these are all 
>>>>> on one backup node, or spread over 10 makes no difference, since 
>>>>> in the first case if you lose the backup node you have to shift 
>>>>> 100% x 1 nodes state. In the second case you have to shift 10% x 
>>>>> 10 nodes state (since the backup node will be carrying 10% of the 
>>>>> state of another 9 nodes as well as your own). Initially it looks 
>>>>> more resilient but...
>>>>>
>>>>> So I am sticking, by virtue of Occam's razor, to the simpler 
>>>>> approach for them moment, until someone can draw attention to a 
>>>>> situation where the extra complexity of a higher granularity 
>>>>> replication strategy is worth the gain.
>>>>>
>>>>>
>>>>> Thinking about it, my current design is probably hybrid - since 
>>>>> whilst a nodes state is all held in a single bucket, individual 
>>>>> sessions may be migrated out of that bucket and into another one 
>>>>> on another node. So it is replication granularity that is set to 
>>>>> node-level, but migration granularity is at session level. I guess 
>>>>> you are suggesting that a bucket is somewhere between the two of 
>>>>> these and is the level at which both are replicated and migrated ? 
>>>>> I'll give it some more thought :-)
>>>>>
>>>>>
>>>>> Jules
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> - I have the feeling that one can not configure a preferred 
>>>>>>> replication group for primary sessions of a specific node: if

>>>>>>> four nodes are available, I would like to configure that 
>>>>>>> sessions of the first node should be replicated by the third

>>>>>>> node, if available, or the fourth one.
>>>>>>>
>>>>>>> - Is it not an overhead to have b-1 replica? AFAIK, a single

>>>>>>> secondary should be enough.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> It all depends on your risk profile I suppose. I backup is 
>>>>>> usually enough but you may want 2 for extra resilience - 
>>>>>> especially as one of those could be in a separate DR zone for 
>>>>>> really serious fail-over scenarios.
>>>>>>
>>>>>> James
>>>>>> -------
>>>>>> http://radio.weblogs.com/0112098/
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> /*************************************
>>>>> * Jules Gosnell
>>>>> * Partner
>>>>> * Core Developers Network (Europe)
>>>>> * http://www.coredevelopers.net
>>>>> *************************************/
>>>>>
>>>>>
>>>>>
>>>>
>>>> /*************************
>>>>  * Dain Sundstrom
>>>>  * Partner
>>>>  * Core Developers Network
>>>>  *************************/
>>>>
>>>
>>>
>>
>>
>
>
>
> ------------------------------------------------------------------------
>


-- 
/*************************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 * http://www.coredevelopers.net
 *************************************/



Mime
View raw message