geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: [clustering] automatic partitoning, state bucketing, etc... (long)
Date Thu, 21 Aug 2003 14:39:15 GMT
continuing from previous para...

there is one final issue which applies equally to this approach and any 
other failover situation (in which you don't assume that the dead node 
will be reincarnated - I think we should not rely on this)...

if A dies and all it's requests fail over to B, B ends up servicing all 
it's own requests and A's.

This is because B itself was already the primary location for a number 
of sessions.

three choices present themselves.

1. a hot-standby for each node - i.e. a buddy that is not a primary 
location for any sessions - this approach is hugely wasteful of 
resources and not really in the spirit of what we are trying to do, 
which is to have a homogeneous cluster, without specialist nodes...

2. B just has to live with it, although we could adjust it's weighting 
within the lb to cut the number of requests coming to it down to just 
those that already have a session with B as it's primary location. 
Eventually the number of sessions on B would die back down to a level 
similar to other nodes and you would put it's lb weighting back up.

3. B offloads state onto other nodes, and adjusts the lbs mapping 
accordingly. This is why I've been considering multiple buckets per 
node. If the 1,2,3 in the routing info were bucket ids, then the lb 
contains bucket-id:node mappings. B could migrate buckets of sessions to 
other nodes and alter bucket:node mappings in the lb. The JSESSIONID 
value held in the browser does not have to change, but the lb has to map 
this value to new physical locations.

mod_jk can do this and I shall confirm that Big-IP and Cisco stuff can 
do the same.

In summary then, my locate/migrate-less approach (sorry to keep downing 
your locate/migrate thang, Jeremy :-) ) requires the following fn-ality 
from a smart lb...

1. API to dynamically add/remove/remap a 'route' to a host:port
2. API to understand a 'routing-list' session-id suffix.
3. API to dynamically adjust 'weighting' on particular route.

This is pretty simple fn-ality, but with it, I think we can do pretty 
much everything that we need to

Jules



Jules Gosnell wrote:

> inline... just one paragraph...
>
> Jules Gosnell wrote:
>
>> Jeremy Boynes wrote:
>>
>>>> I'm going to pick up this thread again :-)
>>>>   
>>>
>>>
>>> We just can't leave alone :-)
>>>
>>>  
>>>
>>>> we have to deal with both dumb and integrated load-balancers...
>>>>
>>>> DUMB LB:
>>>>
>>>> (A) undivided cluster, simple approach:
>>>>
>>>> every node buddies for every other node
>>>> no 'locate/migrate' required since every session is on every node
>>>> replication needs to be synchronous, in order to guarantee that 
>>>> node on
>>>> which nex request falls will be up-to-date
>>>>
>>>> problem: unscalable
>>>>
>>>> (B) subdivided cluster, more complex approach:
>>>>
>>>> cluster subdivided into buddy teams (possibly only of pairs).
>>>> 'locate/migrate' required since request may fall on node that does not
>>>> have session to hand
>>>> primary could use asyn and secondary sync replication, provided that
>>>> 'locate' always talked to primary
>>>>   
>>>
>>>
>>> sync and async are both options - sync may be needed for dumb 
>>> clients (http
>>> 1.0 or ones which overlap requests e.g. for frames)
>>>
>>>  
>>>
>>>> problem: given a cluster of n nodes divided into teams of t nodes: 
>>>> only
>>>> t/n requests will be able to avoid the 'locate/migrate' step - in a
>>>> large cluster with small teams, this is not much more efficient than a
>>>> shared store solution.
>>>>
>>>>   
>>>
>>>
>>> conclusion: DUMB LB is a bad choice in conjunction with a 
>>> replication model
>>> and shared state. Only recommended for use with stateless front ends.
>>>
>> let me qualify to that - it's OK if your cluster is small enough that 
>> it only effectively has one buddy-group. i.e. all nodes arry all the 
>> clusters state. Then you use sync replication and it works - scaling 
>> up will require a smarter LB.
>>
>>>
>>>
>>>  
>>>
>>>> SMART LB (we're assuming it can do pretty much whatever we want it 
>>>> to).
>>>>   
>>>
>>>
>>> We're assuming it is smart in that:
>>> 1) it can affinity sessions (including SSL if required
>>> 2) it can detect failed nodes
>>>
>> I'm assuming that even the dumb one does this (i.e. it's an intrusive 
>> proxy rather than e.g. a non-intrusive round-robin-dns)
>>
>>> 3) it has (possibly configurable) policies for failing over if a 
>>> node dies
>>>
>> hmm...
>>
>>>
>>> I think that's all the capabilities it needs.
>>>
>>>  
>>>
>>>> (A)
>>>>
>>>> assuming affinity, we can use async replication, because request will
>>>> always fall on most up to date node.
>>>>   
>>>
>>>
>>> async is a reliability/performance tradeoff - it introduces a window in
>>> which modified state has not been replicated and may be lost. Again 
>>> sync vs.
>>> async should be configurable.
>>>
>> exactly, and if you have session affinity you might weigh up the odds 
>> of a failure compounded by a consecutive request racing and beating 
>> an asyc update and decide to trade them for the runtime benefit of 
>> async vs sync replication.
>>
>>>
>>>  
>>>
>>>> if this node fails, the lb MUST pick one to failover to and 
>>>> continue to
>>>> use that one (or else we have to fall back to sync and assume dumb lb)
>>>> if original node comes back up, it doesn't matter whether lb goes back
>>>> to it, or remains stuck to fail-over node.
>>>>   
>>>
>>>
>>> Premis is the LB will pick a new node and affinity to it. How it 
>>> picks the
>>> new node is undefined (depends on how the LB works) and may result in a
>>> locate/migrate step if it picks a node without state.
>>>
>>> If the old node comes back it will old and will trigger a 
>>> locate/migrate
>>> step if the LB  picks it (e.g. if it has a preferential affinity 
>>> model).
>>>  
>>>
>> OK - this has been the route of some confusion. You have been 
>> assuming that affinity means 'stick-to-last-visited-node' whereas I 
>> am using the mod_jk reading which is something like 
>> 'stick-to-a-particular-node-and-if-you-lose-it-your-screwed'...
>>
>>>  
>>>
>>>> (B)
>>>>
>>>> if we can arrange for LB use affinity, with failover limited to our
>>>> buddy-group, and always stick to the failover node as well we can lose
>>>> 'locate/migrate' and replicate asych. If we can't get 'always stick to
>>>> failover node', we replicate synch after failover.
>>>>   
>>>
>>>
>>> Again async only works if you are willing to lose state.
>>>
>> you only lose state, as above, in the event of a failure compounded 
>> by a session losing a race with a request - unless you mean that you 
>> lose state because it didn't get off-node before the node went down - 
>> in which case, agreed.
>>
>> when using JavaGroups, if you multicast your RMI, you have the choice 
>> of a number of modes including GET_ALL and GET_NONE. GET_ALL means 
>> that you wait for a reply from all nodes RMI-ed to (sync) , GET_NONE 
>> means that you don't wait for any (async). I may be wrong but I don't 
>> believe it takes any longer for your data to get off-node, you just 
>> don't keep the client hanging around whilst you wait for all your 
>> buddies to confirm their new state...
>>
>> If you take this into account, JG-async is an attractive alternative 
>> since it further compounds the unlikeliness of stateloss to be: 
>> node-failure compounded with transport-failure compounded with a lost 
>> state-xfer/consecutive-request race...
>>
>>>
>>>
>>>  
>>>
>>>> if we can only arrange affinity, but not failover within group, we can
>>>> replicate asynch and need 'locate/migrate'. If we can't have
>>>> lb-remains-stuck-to-failover-node, we are in trouble, because as 
>>>> soon as
>>>> primary node fails we go back to the situation outlined above where we
>>>> do a lot of locate/migrate and are not much better off than a
>>>> shared store.
>>>>
>>>>   
>>>
>>>
>>> Don't get you on this one - maybe we have a different definition of
>>> affinity: mine is that a request will always be directed back to the 
>>> node
>>> that served the last one unless that node becomes unavailable. This 
>>> means
>>> that a request goes to the last node that served it, not the one that
>>> originally created the session.
>>>
>> agreed - our usage differs, as discussed above.
>>
>>>
>>> Even if you have 'affinity to the node that created the session' 
>>> then you
>>> don't get a lot of locate/migrate - just a burst when the node comes 
>>> back
>>> online.
>>>  
>>>
>> but you do. If affinity is only to the creating node and you lose it, 
>> you are then back to a stage where only num-buddies/num-nodes 
>> requests will hit the session-bearing node.... until you either bring 
>> the node back up or (this is where the idea of session bucketing is 
>> useful) call up the lb and remap the sessions routing info (usually 
>> e.g. the node name, but perhaps the bucket name) to another host:port 
>> combination.
>>
>> the burst will go on until the replacement node is up or the 
>> remapping is done and all sessions have been located/migratedback to 
>> their original node.
>>
>>>  
>>>
>>>> The lb-sticks-to-failover-node is not as simple as it sounds - mod_jk
>>>> doesn't do it.
>>>>   
>>>
>>>
>>> :-(
>>>
>>>  
>>>
>>>> it implies
>>>>
>>>> either :
>>>>
>>>> you have the ability to change the routing info carried on the session
>>>> id client side (I've considered this and don't think it practical - I
>>>> may be wrong ...)
>>>>   
>>>
>>>
>>> I'm dubious about this too - it feels wrong but I can't see what it 
>>> breaks.
>>>
>>> I'm assuming that you set JSESSIONID to id.node with node always 
>>> being the
>>> last node that serves it. The LB tries to direct the request to 
>>> node, but if
>>> it is unavailable picks another from its configuration. If the new 
>>> node does
>>> not have state then you do locate/migrate.
>>>
>> not quite.
>>
>> you set the session id ONCE, to e.g. (with mod_jk) 
>> <session-id>.<node-id>
>>
>> resetting it is really problematic.
>>
>> mod_jk maps (in it's workers.properties) this node-id to a host:port 
>> combination.
>>
>> if the host:port is unavailable it chooses another node at random - 
>> it doesn't either try to reset the cookie/url-param with the client 
>> or remember the chosen node for subsequent requests for the same 
>> session,...
>>
>>>
>>>  
>>>
>>>> or :
>>>>
>>>> the session id needs to carry not just a single piece of routing info
>>>> (like a mod_jk worker name) but a failover list 
>>>> worker1,worker2,worker3
>>>> etc in effect your buddy-team,
>>>>
>>>>   
>>>
>>>
>>> Again, requires that the client handles changes to JSESSIONID OK. This
>>> allows the nodes to determine the buddy group and would reduce the 
>>> chance of
>>> locate/migrate being needed.
>>>
>>
>> AHA ! this is where I introduce session buckets.
>>
>> tying the routing info to node ids is a bad idea - because nodes die. 
>> I think we should suffix the session ids with bucket-id. sessions 
>> cannot move out of a bucket, but a bucket can move to another node. 
>> If this happens, you call up the lb and remap the 
>> bucket-id:host-port. You can do this for mod_jk by rewriting it's 
>> workers.properties and 'apachectl graceful' - nasty but solid. 
>> mod_jk2 actually has an http based api that could probably do this. 
>> Big-IP has a SOAP API which also will probably let you do this, etc...
>>
>> The extra level of indirection introduced session-node -> 
>> session-bucket-node allows you this flexibility as well as the 
>> possibility of mapping more than one bucket to each node. Doing this 
>> will help in redistributing state around a cluster when buddy-groups  
>> are created/destroyed....
>>
>>>
>>>  
>>>
>>>> or:
>>>>
>>>> the lb needs to maintain state, remembering where each session was 
>>>> last
>>>> serviced and always sticking requests for that session to that 
>>>> node. in
>>>> a large deployment this requires lbs to replicate this state between
>>>> them so that they can balance over the same nodes in a coordinated
>>>> fashion. I think F5 Big-IP is capable of this, but effectively you 
>>>> just
>>>> shift the state problem from yourself to someone else.
>>>>   
>>>
>>>
>>> Not quite - the LB's are sharing session-to-node affinity data which 
>>> is very
>>> small; the buddies are sharing session state which is much larger. 
>>> You are
>>> sharing the task not shifting it. Yes, the LB's can do this.
>>>
>> agreed - but now you have two pieces of infrastructure replicating 
>> instead of one - I don't see this as ideal. I would rather our 
>> solution worked for simpler lbs that don't support clever stuff like 
>> this...
>>
>>>
>>>  
>>>
>>>> Note that if your lb can understand extended routing info involving 
>>>> the
>>>> whole buddy team, then you know that it will always balance 
>>>> requests to
>>>> members of this team anyway, in which case you can dispense with
>>>> 'locate/migrate' again.
>>>>   
>>>
>>>
>>> It would be useful if the LB did this but I don't think it's a 
>>> requirement.
>>>
>>> I don't think you can dispense with locate unless you are willing to 
>>> lose
>>> sessions.
>>>
>>> For example, if the buddy group is originally (nodeA, nodeB) and 
>>> both those
>>> nodes get cycled out, then the LB will not be able to find a node 
>>> even if
>>> the cluster migrates the data to (nodeC, nodeD). When it see the 
>>> request
>>> come in and knows that A and B are unavailable, it will pick a 
>>> random nod,
>>> say nodeX, and X needs to be able to locate/migrate from C or D.
>>>
>> hmmm... I can't store buddy-group and bucket-info in the session id - 
>> it has to be one or the other - I'll think on this some more and come 
>> back - maybe tonight... 
>
>
>
> let's say you have nodes A,B,C,D...
>
> your session cookie contains xxx.1,2,3 where xxx is the session id and 
> 1,2,3 is the 'routing-info' - a failover list (I think BEA works like 
> this...)
>
> e.g. mod_jk will have :
>
> 1:A
> 2:B
> 3:C
>
> A,B,C are buddies
>
> A dies :-(
>
> your lb will route requests carrying A,B,C to B, in the absence of A 
> (I guess B will just have to put up with double the load - is this a 
> problem - probably yes :-( - probably not just with this solution, but 
> any form of affinity)
>
> your buddy group is now unbalanced
>
> the algorithm I am proposing (let's call it 'clock' because buddies 
> are arranged e.g.  1+2,2+3,3+4,...11+12,12+1 etc) would close the hole 
> in the clock by shifting a couple of buddies from group to group... 
> state would need to shift around too.
>
> whilst requests are still landing on B (note that there has been no 
> need for locate/migrate), we can call up the lb and, hopefully remap to:
>
> 1:B
> 2:C
> 3:D
>
> all that we have to do now, is to make sure that D is brought up to 
> date with B and C by a bulk state transfer....
>
> This approach avoided doing any locate/migrate...
>
> I have definitely seen the routing list approach used somewhere and it 
> probably would not be difficult to extend mod_jk[2] to do it. I will 
> investigate whether Big-IP and Cisco's products might be able to do it...
>
> If this fn-ality were commonly available in lbs, do you not think that 
> this might be an effective way of cutting down further on the 
> internode traffic in our cluster ?
>
> Jules
>
>
>
>
>>
>>
>>>
>>> This also saves the pre-emptive transfer of state from C back to A 
>>> when A
>>> rejoins - it only happens if nodeA gets selected.
>>>
>>>  
>>>
>>>> Finally - you still need a migrate operation as sessions will need to
>>>> migrate from buddy-group to buddy-group as buddy-groups are created 
>>>> and
>>>> destroyed...
>>>>
>>>>
>>>> in summary - I think that you can optimise away 'locate' and a lot of
>>>> 'migrate'-ion - Jetty's current impl has no locate and you can build
>>>> subdivided clusters with it and mod_jk.... but I don't do automatic
>>>> repartitioning yet....
>>>>
>>>>   
>>>
>>>
>>> IIRC Jetty's current impl does it by replicating to all nodes in the
>>> partition and I thought that's what you were trying to reduce :-)
>>>
>> that is ONE way - but you can also, using mod_jk and a hierarchical 
>> lb arrangement (where lb is the worker type, not the Apache process), 
>> split up your cluster into buddy-groups. Affinity is then done to the 
>> buddy-group, rather than the node. Sync replication within the group 
>> gives you a scalable solution. Effectively, each buddy-group becomes 
>> it's own mini-cluster.
>>
>>>
>>> The basic tradeoff is wide-replication vs. locate-after-death - they 
>>> both
>>> work, I just think locate-after-death results in less overhead 
>>> during normal
>>> operation at the cost of more after a membership change, which seems
>>> preferable.
>>>  
>>>
>> this is correct - I am simply trying to demonstrate, that if we look 
>> hard enough, there should be solutions where even after node death we 
>> can avoid session locate/migrate.
>>
>> I would not feel easy performing maintenance on my cluster, if I new 
>> that each time I start/stop a node there was likely to be a huge 
>> flurry of activity as sessions are located/migrated everywhere - this 
>> is bound to impact on QoS, which is to be avoided..
>>
>>
>> Jules
>>
>>>  
>>>
>>>> If you are still reading here, then you are doing well :-)
>>>>   
>>>
>>>
>>> Or you're just one sick puppy :-)
>>>
>>> -- 
>>> Jeremy
>>>
>>>  
>>>
>>
>>
>
>


-- 
/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 **********************************/



Mime
View raw message