geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: [clustering] automatic partitoning, state bucketing, etc... (long)
Date Thu, 21 Aug 2003 12:09:44 GMT
Jeremy Boynes wrote:

>>I'm going to pick up this thread again :-)
>>    
>>
>We just can't leave alone :-)
>
>  
>
>>we have to deal with both dumb and integrated load-balancers...
>>
>>DUMB LB:
>>
>>(A) undivided cluster, simple approach:
>>
>>every node buddies for every other node
>>no 'locate/migrate' required since every session is on every node
>>replication needs to be synchronous, in order to guarantee that node on
>>which nex request falls will be up-to-date
>>
>>problem: unscalable
>>
>>(B) subdivided cluster, more complex approach:
>>
>>cluster subdivided into buddy teams (possibly only of pairs).
>>'locate/migrate' required since request may fall on node that does not
>>have session to hand
>>primary could use asyn and secondary sync replication, provided that
>>'locate' always talked to primary
>>    
>>
>sync and async are both options - sync may be needed for dumb clients (http
>1.0 or ones which overlap requests e.g. for frames)
>
>  
>
>>problem: given a cluster of n nodes divided into teams of t nodes: only
>>t/n requests will be able to avoid the 'locate/migrate' step - in a
>>large cluster with small teams, this is not much more efficient than a
>>shared store solution.
>>
>>    
>>
>conclusion: DUMB LB is a bad choice in conjunction with a replication model
>and shared state. Only recommended for use with stateless front ends.
>
let me qualify to that - it's OK if your cluster is small enough that it 
only effectively has one buddy-group. i.e. all nodes arry all the 
clusters state. Then you use sync replication and it works - scaling up 
will require a smarter LB.

>
>
>  
>
>>SMART LB (we're assuming it can do pretty much whatever we want it to).
>>    
>>
>We're assuming it is smart in that:
>1) it can affinity sessions (including SSL if required
>2) it can detect failed nodes
>
I'm assuming that even the dumb one does this (i.e. it's an intrusive 
proxy rather than e.g. a non-intrusive round-robin-dns)

>3) it has (possibly configurable) policies for failing over if a node dies
>
hmm...

>
>I think that's all the capabilities it needs.
>
>  
>
>>(A)
>>
>>assuming affinity, we can use async replication, because request will
>>always fall on most up to date node.
>>    
>>
>async is a reliability/performance tradeoff - it introduces a window in
>which modified state has not been replicated and may be lost. Again sync vs.
>async should be configurable.
>
exactly, and if you have session affinity you might weigh up the odds of 
a failure compounded by a consecutive request racing and beating an asyc 
update and decide to trade them for the runtime benefit of async vs sync 
replication.

>
>  
>
>>if this node fails, the lb MUST pick one to failover to and continue to
>>use that one (or else we have to fall back to sync and assume dumb lb)
>>if original node comes back up, it doesn't matter whether lb goes back
>>to it, or remains stuck to fail-over node.
>>    
>>
>Premis is the LB will pick a new node and affinity to it. How it picks the
>new node is undefined (depends on how the LB works) and may result in a
>locate/migrate step if it picks a node without state.
>
>If the old node comes back it will old and will trigger a locate/migrate
>step if the LB  picks it (e.g. if it has a preferential affinity model).
>  
>
OK - this has been the route of some confusion. You have been assuming 
that affinity means 'stick-to-last-visited-node' whereas I am using the 
mod_jk reading which is something like 
'stick-to-a-particular-node-and-if-you-lose-it-your-screwed'...

>  
>
>>(B)
>>
>>if we can arrange for LB use affinity, with failover limited to our
>>buddy-group, and always stick to the failover node as well we can lose
>>'locate/migrate' and replicate asych. If we can't get 'always stick to
>>failover node', we replicate synch after failover.
>>    
>>
>Again async only works if you are willing to lose state.
>
you only lose state, as above, in the event of a failure compounded by a 
session losing a race with a request - unless you mean that you lose 
state because it didn't get off-node before the node went down - in 
which case, agreed.

when using JavaGroups, if you multicast your RMI, you have the choice of 
a number of modes including GET_ALL and GET_NONE. GET_ALL means that you 
wait for a reply from all nodes RMI-ed to (sync) , GET_NONE means that 
you don't wait for any (async). I may be wrong but I don't believe it 
takes any longer for your data to get off-node, you just don't keep the 
client hanging around whilst you wait for all your buddies to confirm 
their new state...

If you take this into account, JG-async is an attractive alternative 
since it further compounds the unlikeliness of stateloss to be: 
node-failure compounded with transport-failure compounded with a lost 
state-xfer/consecutive-request race...

> 
>
>  
>
>>if we can only arrange affinity, but not failover within group, we can
>>replicate asynch and need 'locate/migrate'. If we can't have
>>lb-remains-stuck-to-failover-node, we are in trouble, because as soon as
>>primary node fails we go back to the situation outlined above where we
>>do a lot of locate/migrate and are not much better off than a
>>shared store.
>>
>>    
>>
>Don't get you on this one - maybe we have a different definition of
>affinity: mine is that a request will always be directed back to the node
>that served the last one unless that node becomes unavailable. This means
>that a request goes to the last node that served it, not the one that
>originally created the session.
>
agreed - our usage differs, as discussed above.

>
>Even if you have 'affinity to the node that created the session' then you
>don't get a lot of locate/migrate - just a burst when the node comes back
>online.
>  
>
but you do. If affinity is only to the creating node and you lose it, 
you are then back to a stage where only num-buddies/num-nodes requests 
will hit the session-bearing node.... until you either bring the node 
back up or (this is where the idea of session bucketing is useful) call 
up the lb and remap the sessions routing info (usually e.g. the node 
name, but perhaps the bucket name) to another host:port combination.

the burst will go on until the replacement node is up or the remapping 
is done and all sessions have been located/migratedback to their 
original node.

>  
>
>>The lb-sticks-to-failover-node is not as simple as it sounds - mod_jk
>>doesn't do it.
>>    
>>
>:-(
>
>  
>
>>it implies
>>
>>either :
>>
>>you have the ability to change the routing info carried on the session
>>id client side (I've considered this and don't think it practical - I
>>may be wrong ...)
>>    
>>
>I'm dubious about this too - it feels wrong but I can't see what it breaks.
>
>I'm assuming that you set JSESSIONID to id.node with node always being the
>last node that serves it. The LB tries to direct the request to node, but if
>it is unavailable picks another from its configuration. If the new node does
>not have state then you do locate/migrate.
>
not quite.

you set the session id ONCE, to e.g. (with mod_jk) <session-id>.<node-id>

resetting it is really problematic.

mod_jk maps (in it's workers.properties) this node-id to a host:port 
combination.

if the host:port is unavailable it chooses another node at random - it 
doesn't either try to reset the cookie/url-param with the client or 
remember the chosen node for subsequent requests for the same session,...

>
>  
>
>>or :
>>
>>the session id needs to carry not just a single piece of routing info
>>(like a mod_jk worker name) but a failover list worker1,worker2,worker3
>>etc in effect your buddy-team,
>>
>>    
>>
>Again, requires that the client handles changes to JSESSIONID OK. This
>allows the nodes to determine the buddy group and would reduce the chance of
>locate/migrate being needed.
>

AHA ! this is where I introduce session buckets.

tying the routing info to node ids is a bad idea - because nodes die. I 
think we should suffix the session ids with bucket-id. sessions cannot 
move out of a bucket, but a bucket can move to another node. If this 
happens, you call up the lb and remap the bucket-id:host-port. You can 
do this for mod_jk by rewriting it's workers.properties and 'apachectl 
graceful' - nasty but solid. mod_jk2 actually has an http based api that 
could probably do this. Big-IP has a SOAP API which also will probably 
let you do this, etc...

The extra level of indirection introduced session-node -> 
session-bucket-node allows you this flexibility as well as the 
possibility of mapping more than one bucket to each node. Doing this 
will help in redistributing state around a cluster when buddy-groups  
are created/destroyed....

>
>  
>
>>or:
>>
>>the lb needs to maintain state, remembering where each session was last
>>serviced and always sticking requests for that session to that node. in
>>a large deployment this requires lbs to replicate this state between
>>them so that they can balance over the same nodes in a coordinated
>>fashion. I think F5 Big-IP is capable of this, but effectively you just
>>shift the state problem from yourself to someone else.
>>    
>>
>Not quite - the LB's are sharing session-to-node affinity data which is very
>small; the buddies are sharing session state which is much larger. You are
>sharing the task not shifting it. Yes, the LB's can do this.
>
agreed - but now you have two pieces of infrastructure replicating 
instead of one - I don't see this as ideal. I would rather our solution 
worked for simpler lbs that don't support clever stuff like this...

>
>  
>
>>Note that if your lb can understand extended routing info involving the
>>whole buddy team, then you know that it will always balance requests to
>>members of this team anyway, in which case you can dispense with
>>'locate/migrate' again.
>>    
>>
>It would be useful if the LB did this but I don't think it's a requirement.
>
>I don't think you can dispense with locate unless you are willing to lose
>sessions.
>
>For example, if the buddy group is originally (nodeA, nodeB) and both those
>nodes get cycled out, then the LB will not be able to find a node even if
>the cluster migrates the data to (nodeC, nodeD). When it see the request
>come in and knows that A and B are unavailable, it will pick a random nod,
>say nodeX, and X needs to be able to locate/migrate from C or D.
>
hmmm... I can't store buddy-group and bucket-info in the session id - it 
has to be one or the other - I'll think on this some more and come back 
- maybe tonight...

>
>This also saves the pre-emptive transfer of state from C back to A when A
>rejoins - it only happens if nodeA gets selected.
>
>  
>
>>Finally - you still need a migrate operation as sessions will need to
>>migrate from buddy-group to buddy-group as buddy-groups are created and
>>destroyed...
>>
>>
>>in summary - I think that you can optimise away 'locate' and a lot of
>>'migrate'-ion - Jetty's current impl has no locate and you can build
>>subdivided clusters with it and mod_jk.... but I don't do automatic
>>repartitioning yet....
>>
>>    
>>
>IIRC Jetty's current impl does it by replicating to all nodes in the
>partition and I thought that's what you were trying to reduce :-)
>
that is ONE way - but you can also, using mod_jk and a hierarchical lb 
arrangement (where lb is the worker type, not the Apache process), split 
up your cluster into buddy-groups. Affinity is then done to the 
buddy-group, rather than the node. Sync replication within the group 
gives you a scalable solution. Effectively, each buddy-group becomes 
it's own mini-cluster.

>
>The basic tradeoff is wide-replication vs. locate-after-death - they both
>work, I just think locate-after-death results in less overhead during normal
>operation at the cost of more after a membership change, which seems
>preferable.
>  
>
this is correct - I am simply trying to demonstrate, that if we look 
hard enough, there should be solutions where even after node death we 
can avoid session locate/migrate.

I would not feel easy performing maintenance on my cluster, if I new 
that each time I start/stop a node there was likely to be a huge flurry 
of activity as sessions are located/migrated everywhere - this is bound 
to impact on QoS, which is to be avoided..


Jules

>  
>
>>If you are still reading here, then you are doing well :-)
>>    
>>
>Or you're just one sick puppy :-)
>
>--
>Jeremy
>
>  
>


-- 
/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 **********************************/



Mime
View raw message