geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <>
Subject Re: [clustering] automatic partitoning, state bucketing, etc... (long)
Date Wed, 20 Aug 2003 12:09:14 GMT
I'm going to pick up this thread again :-)

we have to deal with both dumb and integrated load-balancers...


(A) undivided cluster, simple approach:

every node buddies for every other node
no 'locate/migrate' required since every session is on every node
replication needs to be synchronous, in order to guarantee that node on 
which nex request falls will be up-to-date

problem: unscalable

(B) subdivided cluster, more complex approach:

cluster subdivided into buddy teams (possibly only of pairs).
'locate/migrate' required since request may fall on node that does not 
have session to hand
primary could use asyn and secondary sync replication, provided that 
'locate' always talked to primary

problem: given a cluster of n nodes divided into teams of t nodes: only 
t/n requests will be able to avoid the 'locate/migrate' step - in a 
large cluster with small teams, this is not much more efficient than a 
shared store solution.

SMART LB (we're assuming it can do pretty much whatever we want it to).


assuming affinity, we can use async replication, because request will 
always fall on most up to date node.
if this node fails, the lb MUST pick one to failover to and continue to 
use that one (or else we have to fall back to sync and assume dumb lb)
if original node comes back up, it doesn't matter whether lb goes back 
to it, or remains stuck to fail-over node.


if we can arrange for LB use affinity, with failover limited to our 
buddy-group, and always stick to the failover node as well we can lose 
'locate/migrate' and replicate asych. If we can't get 'always stick to 
failover node', we replicate synch after failover.

if we can only arrange affinity, but not failover within group, we can 
replicate asynch and need 'locate/migrate'. If we can't have 
lb-remains-stuck-to-failover-node, we are in trouble, because as soon as 
primary node fails we go back to the situation outlined above where we 
do a lot of locate/migrate and are not much better off than a shared store.

The lb-sticks-to-failover-node is not as simple as it sounds - mod_jk 
doesn't do it.

it implies

either :

you have the ability to change the routing info carried on the session 
id client side (I've considered this and don't think it practical - I 
may be wrong ...)

or :

the session id needs to carry not just a single piece of routing info 
(like a mod_jk worker name) but a failover list worker1,worker2,worker3 
etc in effect your buddy-team,


the lb needs to maintain state, remembering where each session was last 
serviced and always sticking requests for that session to that node. in 
a large deployment this requires lbs to replicate this state between 
them so that they can balance over the same nodes in a coordinated 
fashion. I think F5 Big-IP is capable of this, but effectively you just 
shift the state problem from yourself to someone else.

Note that if your lb can understand extended routing info involving the 
whole buddy team, then you know that it will always balance requests to 
members of this team anyway, in which case you can dispense with 
'locate/migrate' again.

Finally - you still need a migrate operation as sessions will need to 
migrate from buddy-group to buddy-group as buddy-groups are created and 

in summary - I think that you can optimise away 'locate' and a lot of 
'migrate'-ion - Jetty's current impl has no locate and you can build 
subdivided clusters with it and mod_jk.... but I don't do automatic 
repartitioning yet....

If you are still reading here, then you are doing well :-)


Jeremy Boynes wrote:

>>I figure that we are talking about two different and orthogonal types of
>>partition here.
>>I'm happy to call the way that nodes are linked into buddy-groups
>>(groups of peers that store replicated state for each other) something
>>other than 'partition', if we want to reserve that term for some sort of
>>cluster management concept, but you do agree that these structures
>>exist, do you not ? regardless of what they are called, otherwise you do
>>not scale, as we have all agreed.
>>As for loadbalancer configuration I think this will draw upon both
>>'jeremy-partition' and 'jules-buddy-group' status as :
>>- you only want to balance requests for a webapp to nodes on which it is
>>- you only want to fail-over requests to other nodes in the same
>>buddy-group as the failed node
>Ideally, yes but this is not essential. See below.
>>if you can do the latter you can avoid cluster-wide logic for findg and
>>migrating sessions from remote nodes to the one receiving the request,
>>because you can guarantee that the session is already there.
>The price to pay for this is that you always need to replicate state to any
>node to which the request may be directed. If you allow for a locate phase,
>then you can minimise the set of nodes to which data is replicated (the
>buddy-group) because any node can find it later. In a high-affinity
>configuration this reduces the overall load.
>Consider a four node partition A,B,C,D. In the 'replicate-everywhere' model,
>A's state is replicated to three other nodes after every request, incurring
>the processing cost on three nodes (assuming network multicast). If A dies,
>any node can instantly pick up the work. The issue is we have a lot of
>overhead to reduce the latency in the event of node death (which we hope is
>The other alternative is that every session has one and only one buddy. This
>would result in 1/3 of A's sessions being replicated to B, 1/3 to C and 1/3
>to D. Each session is replicated to just one node, allowing unicast to be
>used (which has a lower overhead than multicast) and only incurring the
>ongoing processing cost on one node.
>If A dies, then B,C,D pick new buddies for A's sessions and do bulk state
>transfer to redistribute, ensuring that the state is always stored on two
>nodes. Say B transfers to C, C to D and D to B. Again, unicast transfer. You
>can avoid this if you are willing to lose a session if another node dies
>(double failure scenario).
>An A request is now directed to a random node; if this node has the state,
>then it becomes the primary and starts replicating to its buddy. If it does
>not, then it sends a multicast inquiry to the partition, locates the state,
>does a second transfer and starts replicating again.
>The trade off is lower overhead whilst running but a larger state transfer
>in the event of node death. I tend to prefer the latter on the basis that
>node deaths are infrequent.
>>Are we getting closer ?

 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)

View raw message