geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: [clustering] automatic partitoning, state bucketing, etc... (long)
Date Thu, 21 Aug 2003 12:52:47 GMT
inline... just one paragraph...

Jules Gosnell wrote:

> Jeremy Boynes wrote:
>
>>> I'm going to pick up this thread again :-)
>>>   
>>
>> We just can't leave alone :-)
>>
>>  
>>
>>> we have to deal with both dumb and integrated load-balancers...
>>>
>>> DUMB LB:
>>>
>>> (A) undivided cluster, simple approach:
>>>
>>> every node buddies for every other node
>>> no 'locate/migrate' required since every session is on every node
>>> replication needs to be synchronous, in order to guarantee that node on
>>> which nex request falls will be up-to-date
>>>
>>> problem: unscalable
>>>
>>> (B) subdivided cluster, more complex approach:
>>>
>>> cluster subdivided into buddy teams (possibly only of pairs).
>>> 'locate/migrate' required since request may fall on node that does not
>>> have session to hand
>>> primary could use asyn and secondary sync replication, provided that
>>> 'locate' always talked to primary
>>>   
>>
>> sync and async are both options - sync may be needed for dumb clients 
>> (http
>> 1.0 or ones which overlap requests e.g. for frames)
>>
>>  
>>
>>> problem: given a cluster of n nodes divided into teams of t nodes: only
>>> t/n requests will be able to avoid the 'locate/migrate' step - in a
>>> large cluster with small teams, this is not much more efficient than a
>>> shared store solution.
>>>
>>>   
>>
>> conclusion: DUMB LB is a bad choice in conjunction with a replication 
>> model
>> and shared state. Only recommended for use with stateless front ends.
>>
> let me qualify to that - it's OK if your cluster is small enough that 
> it only effectively has one buddy-group. i.e. all nodes arry all the 
> clusters state. Then you use sync replication and it works - scaling 
> up will require a smarter LB.
>
>>
>>
>>  
>>
>>> SMART LB (we're assuming it can do pretty much whatever we want it to).
>>>   
>>
>> We're assuming it is smart in that:
>> 1) it can affinity sessions (including SSL if required
>> 2) it can detect failed nodes
>>
> I'm assuming that even the dumb one does this (i.e. it's an intrusive 
> proxy rather than e.g. a non-intrusive round-robin-dns)
>
>> 3) it has (possibly configurable) policies for failing over if a node 
>> dies
>>
> hmm...
>
>>
>> I think that's all the capabilities it needs.
>>
>>  
>>
>>> (A)
>>>
>>> assuming affinity, we can use async replication, because request will
>>> always fall on most up to date node.
>>>   
>>
>> async is a reliability/performance tradeoff - it introduces a window in
>> which modified state has not been replicated and may be lost. Again 
>> sync vs.
>> async should be configurable.
>>
> exactly, and if you have session affinity you might weigh up the odds 
> of a failure compounded by a consecutive request racing and beating an 
> asyc update and decide to trade them for the runtime benefit of async 
> vs sync replication.
>
>>
>>  
>>
>>> if this node fails, the lb MUST pick one to failover to and continue to
>>> use that one (or else we have to fall back to sync and assume dumb lb)
>>> if original node comes back up, it doesn't matter whether lb goes back
>>> to it, or remains stuck to fail-over node.
>>>   
>>
>> Premis is the LB will pick a new node and affinity to it. How it 
>> picks the
>> new node is undefined (depends on how the LB works) and may result in a
>> locate/migrate step if it picks a node without state.
>>
>> If the old node comes back it will old and will trigger a locate/migrate
>> step if the LB  picks it (e.g. if it has a preferential affinity model).
>>  
>>
> OK - this has been the route of some confusion. You have been assuming 
> that affinity means 'stick-to-last-visited-node' whereas I am using 
> the mod_jk reading which is something like 
> 'stick-to-a-particular-node-and-if-you-lose-it-your-screwed'...
>
>>  
>>
>>> (B)
>>>
>>> if we can arrange for LB use affinity, with failover limited to our
>>> buddy-group, and always stick to the failover node as well we can lose
>>> 'locate/migrate' and replicate asych. If we can't get 'always stick to
>>> failover node', we replicate synch after failover.
>>>   
>>
>> Again async only works if you are willing to lose state.
>>
> you only lose state, as above, in the event of a failure compounded by 
> a session losing a race with a request - unless you mean that you lose 
> state because it didn't get off-node before the node went down - in 
> which case, agreed.
>
> when using JavaGroups, if you multicast your RMI, you have the choice 
> of a number of modes including GET_ALL and GET_NONE. GET_ALL means 
> that you wait for a reply from all nodes RMI-ed to (sync) , GET_NONE 
> means that you don't wait for any (async). I may be wrong but I don't 
> believe it takes any longer for your data to get off-node, you just 
> don't keep the client hanging around whilst you wait for all your 
> buddies to confirm their new state...
>
> If you take this into account, JG-async is an attractive alternative 
> since it further compounds the unlikeliness of stateloss to be: 
> node-failure compounded with transport-failure compounded with a lost 
> state-xfer/consecutive-request race...
>
>>
>>
>>  
>>
>>> if we can only arrange affinity, but not failover within group, we can
>>> replicate asynch and need 'locate/migrate'. If we can't have
>>> lb-remains-stuck-to-failover-node, we are in trouble, because as 
>>> soon as
>>> primary node fails we go back to the situation outlined above where we
>>> do a lot of locate/migrate and are not much better off than a
>>> shared store.
>>>
>>>   
>>
>> Don't get you on this one - maybe we have a different definition of
>> affinity: mine is that a request will always be directed back to the 
>> node
>> that served the last one unless that node becomes unavailable. This 
>> means
>> that a request goes to the last node that served it, not the one that
>> originally created the session.
>>
> agreed - our usage differs, as discussed above.
>
>>
>> Even if you have 'affinity to the node that created the session' then 
>> you
>> don't get a lot of locate/migrate - just a burst when the node comes 
>> back
>> online.
>>  
>>
> but you do. If affinity is only to the creating node and you lose it, 
> you are then back to a stage where only num-buddies/num-nodes requests 
> will hit the session-bearing node.... until you either bring the node 
> back up or (this is where the idea of session bucketing is useful) 
> call up the lb and remap the sessions routing info (usually e.g. the 
> node name, but perhaps the bucket name) to another host:port combination.
>
> the burst will go on until the replacement node is up or the remapping 
> is done and all sessions have been located/migratedback to their 
> original node.
>
>>  
>>
>>> The lb-sticks-to-failover-node is not as simple as it sounds - mod_jk
>>> doesn't do it.
>>>   
>>
>> :-(
>>
>>  
>>
>>> it implies
>>>
>>> either :
>>>
>>> you have the ability to change the routing info carried on the session
>>> id client side (I've considered this and don't think it practical - I
>>> may be wrong ...)
>>>   
>>
>> I'm dubious about this too - it feels wrong but I can't see what it 
>> breaks.
>>
>> I'm assuming that you set JSESSIONID to id.node with node always 
>> being the
>> last node that serves it. The LB tries to direct the request to node, 
>> but if
>> it is unavailable picks another from its configuration. If the new 
>> node does
>> not have state then you do locate/migrate.
>>
> not quite.
>
> you set the session id ONCE, to e.g. (with mod_jk) <session-id>.<node-id>
>
> resetting it is really problematic.
>
> mod_jk maps (in it's workers.properties) this node-id to a host:port 
> combination.
>
> if the host:port is unavailable it chooses another node at random - it 
> doesn't either try to reset the cookie/url-param with the client or 
> remember the chosen node for subsequent requests for the same session,...
>
>>
>>  
>>
>>> or :
>>>
>>> the session id needs to carry not just a single piece of routing info
>>> (like a mod_jk worker name) but a failover list worker1,worker2,worker3
>>> etc in effect your buddy-team,
>>>
>>>   
>>
>> Again, requires that the client handles changes to JSESSIONID OK. This
>> allows the nodes to determine the buddy group and would reduce the 
>> chance of
>> locate/migrate being needed.
>>
>
> AHA ! this is where I introduce session buckets.
>
> tying the routing info to node ids is a bad idea - because nodes die. 
> I think we should suffix the session ids with bucket-id. sessions 
> cannot move out of a bucket, but a bucket can move to another node. If 
> this happens, you call up the lb and remap the bucket-id:host-port. 
> You can do this for mod_jk by rewriting it's workers.properties and 
> 'apachectl graceful' - nasty but solid. mod_jk2 actually has an http 
> based api that could probably do this. Big-IP has a SOAP API which 
> also will probably let you do this, etc...
>
> The extra level of indirection introduced session-node -> 
> session-bucket-node allows you this flexibility as well as the 
> possibility of mapping more than one bucket to each node. Doing this 
> will help in redistributing state around a cluster when buddy-groups  
> are created/destroyed....
>
>>
>>  
>>
>>> or:
>>>
>>> the lb needs to maintain state, remembering where each session was last
>>> serviced and always sticking requests for that session to that node. in
>>> a large deployment this requires lbs to replicate this state between
>>> them so that they can balance over the same nodes in a coordinated
>>> fashion. I think F5 Big-IP is capable of this, but effectively you just
>>> shift the state problem from yourself to someone else.
>>>   
>>
>> Not quite - the LB's are sharing session-to-node affinity data which 
>> is very
>> small; the buddies are sharing session state which is much larger. 
>> You are
>> sharing the task not shifting it. Yes, the LB's can do this.
>>
> agreed - but now you have two pieces of infrastructure replicating 
> instead of one - I don't see this as ideal. I would rather our 
> solution worked for simpler lbs that don't support clever stuff like 
> this...
>
>>
>>  
>>
>>> Note that if your lb can understand extended routing info involving the
>>> whole buddy team, then you know that it will always balance requests to
>>> members of this team anyway, in which case you can dispense with
>>> 'locate/migrate' again.
>>>   
>>
>> It would be useful if the LB did this but I don't think it's a 
>> requirement.
>>
>> I don't think you can dispense with locate unless you are willing to 
>> lose
>> sessions.
>>
>> For example, if the buddy group is originally (nodeA, nodeB) and both 
>> those
>> nodes get cycled out, then the LB will not be able to find a node 
>> even if
>> the cluster migrates the data to (nodeC, nodeD). When it see the request
>> come in and knows that A and B are unavailable, it will pick a random 
>> nod,
>> say nodeX, and X needs to be able to locate/migrate from C or D.
>>
> hmmm... I can't store buddy-group and bucket-info in the session id - 
> it has to be one or the other - I'll think on this some more and come 
> back - maybe tonight... 


let's say you have nodes A,B,C,D...

your session cookie contains xxx.1,2,3 where xxx is the session id and 
1,2,3 is the 'routing-info' - a failover list (I think BEA works like 
this...)

e.g. mod_jk will have :

1:A
2:B
3:C

A,B,C are buddies

A dies :-(

your lb will route requests carrying A,B,C to B, in the absence of A (I 
guess B will just have to put up with double the load - is this a 
problem - probably yes :-( - probably not just with this solution, but 
any form of affinity)

your buddy group is now unbalanced

the algorithm I am proposing (let's call it 'clock' because buddies are 
arranged e.g.  1+2,2+3,3+4,...11+12,12+1 etc) would close the hole in 
the clock by shifting a couple of buddies from group to group... state 
would need to shift around too.

whilst requests are still landing on B (note that there has been no need 
for locate/migrate), we can call up the lb and, hopefully remap to:

1:B
2:C
3:D

all that we have to do now, is to make sure that D is brought up to date 
with B and C by a bulk state transfer....

This approach avoided doing any locate/migrate...

I have definitely seen the routing list approach used somewhere and it 
probably would not be difficult to extend mod_jk[2] to do it. I will 
investigate whether Big-IP and Cisco's products might be able to do it...

If this fn-ality were commonly available in lbs, do you not think that 
this might be an effective way of cutting down further on the internode 
traffic in our cluster ?

Jules




>
>
>>
>> This also saves the pre-emptive transfer of state from C back to A 
>> when A
>> rejoins - it only happens if nodeA gets selected.
>>
>>  
>>
>>> Finally - you still need a migrate operation as sessions will need to
>>> migrate from buddy-group to buddy-group as buddy-groups are created and
>>> destroyed...
>>>
>>>
>>> in summary - I think that you can optimise away 'locate' and a lot of
>>> 'migrate'-ion - Jetty's current impl has no locate and you can build
>>> subdivided clusters with it and mod_jk.... but I don't do automatic
>>> repartitioning yet....
>>>
>>>   
>>
>> IIRC Jetty's current impl does it by replicating to all nodes in the
>> partition and I thought that's what you were trying to reduce :-)
>>
> that is ONE way - but you can also, using mod_jk and a hierarchical lb 
> arrangement (where lb is the worker type, not the Apache process), 
> split up your cluster into buddy-groups. Affinity is then done to the 
> buddy-group, rather than the node. Sync replication within the group 
> gives you a scalable solution. Effectively, each buddy-group becomes 
> it's own mini-cluster.
>
>>
>> The basic tradeoff is wide-replication vs. locate-after-death - they 
>> both
>> work, I just think locate-after-death results in less overhead during 
>> normal
>> operation at the cost of more after a membership change, which seems
>> preferable.
>>  
>>
> this is correct - I am simply trying to demonstrate, that if we look 
> hard enough, there should be solutions where even after node death we 
> can avoid session locate/migrate.
>
> I would not feel easy performing maintenance on my cluster, if I new 
> that each time I start/stop a node there was likely to be a huge 
> flurry of activity as sessions are located/migrated everywhere - this 
> is bound to impact on QoS, which is to be avoided..
>
>
> Jules
>
>>  
>>
>>> If you are still reading here, then you are doing well :-)
>>>   
>>
>> Or you're just one sick puppy :-)
>>
>> -- 
>> Jeremy
>>
>>  
>>
>
>


-- 
/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 **********************************/



Mime
View raw message