geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: [clustering] automatic partitoning, state bucketing, etc... (long)
Date Wed, 20 Aug 2003 13:08:09 GMT
inline...

Jules Gosnell wrote:

> I'm going to pick up this thread again :-)
>
> we have to deal with both dumb and integrated load-balancers...
>
> DUMB LB:
>
> (A) undivided cluster, simple approach:
>
> every node buddies for every other node
> no 'locate/migrate' required since every session is on every node
> replication needs to be synchronous, in order to guarantee that node 
> on which nex request falls will be up-to-date
>
> problem: unscalable
>
> (B) subdivided cluster, more complex approach:
>
> cluster subdivided into buddy teams (possibly only of pairs).
> 'locate/migrate' required since request may fall on node that does not 
> have session to hand
> primary could use asyn and secondary sync replication, provided that 
> 'locate' always talked to primary
>
> problem: given a cluster of n nodes divided into teams of t nodes: 
> only t/n requests will be able to avoid the 'locate/migrate' step - in 
> a large cluster with small teams, this is not much more efficient than 
> a shared store solution.
>
> SMART LB (we're assuming it can do pretty much whatever we want it to).
>
> (A)
>
> assuming affinity, we can use async replication, because request will 
> always fall on most up to date node.
> if this node fails, the lb MUST pick one to failover to and continue 
> to use that one (or else we have to fall back to sync and assume dumb lb)
> if original node comes back up, it doesn't matter whether lb goes back 
> to it, or remains stuck to fail-over node.
>
> (B)
>
> if we can arrange for LB use affinity, with failover limited to our 
> buddy-group, and always stick to the failover node as well we can lose 
> 'locate/migrate' and replicate asych. If we can't get 'always stick to 
> failover node', we replicate synch after failover.
>
> if we can only arrange affinity, but not failover within group, we can 
> replicate asynch and need 'locate/migrate'. If we can't have 
> lb-remains-stuck-to-failover-node, we are in trouble, because as soon 
> as primary node fails we go back to the situation outlined above where 
> we do a lot of locate/migrate and are not much better off than a 
> shared store.
>
>
> The lb-sticks-to-failover-node is not as simple as it sounds - mod_jk 
> doesn't do it.
>
> it implies
>
> either :
>
> you have the ability to change the routing info carried on the session 
> id client side (I've considered this and don't think it practical - I 
> may be wrong ...)
>
> or :
>
> the session id needs to carry not just a single piece of routing info 
> (like a mod_jk worker name) but a failover list 
> worker1,worker2,worker3 etc in effect your buddy-team,
>
> or:
>
> the lb needs to maintain state, remembering where each session was 
> last serviced and always sticking requests for that session to that 
> node. in a large deployment this requires lbs to replicate this state 
> between them so that they can balance over the same nodes in a 
> coordinated fashion. I think F5 Big-IP is capable of this, but 
> effectively you just shift the state problem from yourself to someone 
> else.

or (and we are moving into state bucketing territory now)

the routing info does not name a node but a bucket. when the cluster 
detects a node down, it remaps the bucket-name:host-port pair held by 
the lb (you can do this with mod_jk and I expect with F5 Big-IP). There 
may be a small period where the lb does not know which node to use - you 
could either ensure against this with mechanisms outlined above or use 
'locate/migrate' as a last resort here...

Jules

>
> Note that if your lb can understand extended routing info involving 
> the whole buddy team, then you know that it will always balance 
> requests to members of this team anyway, in which case you can 
> dispense with 'locate/migrate' again.
>
> Finally - you still need a migrate operation as sessions will need to 
> migrate from buddy-group to buddy-group as buddy-groups are created 
> and destroyed...
>
>
> in summary - I think that you can optimise away 'locate' and a lot of 
> 'migrate'-ion - Jetty's current impl has no locate and you can build 
> subdivided clusters with it and mod_jk.... but I don't do automatic 
> repartitioning yet....
>
>
> If you are still reading here, then you are doing well :-)
>
>
>
> Jules
>
>
> Jeremy Boynes wrote:
>
>>> I figure that we are talking about two different and orthogonal 
>>> types of
>>> partition here.
>>>   
>>
>> Agreed.
>>
>>  
>>
>>> I'm happy to call the way that nodes are linked into buddy-groups
>>> (groups of peers that store replicated state for each other) something
>>> other than 'partition', if we want to reserve that term for some 
>>> sort of
>>> cluster management concept, but you do agree that these structures
>>> exist, do you not ? regardless of what they are called, otherwise 
>>> you do
>>> not scale, as we have all agreed.
>>>
>>> As for loadbalancer configuration I think this will draw upon both
>>> 'jeremy-partition' and 'jules-buddy-group' status as :
>>>
>>> - you only want to balance requests for a webapp to nodes on which 
>>> it is
>>> deployed
>>>   
>>
>> Yes
>>
>>  
>>
>>> - you only want to fail-over requests to other nodes in the same
>>> buddy-group as the failed node
>>>   
>>
>> Ideally, yes but this is not essential. See below.
>>
>>  
>>
>>> if you can do the latter you can avoid cluster-wide logic for findg and
>>> migrating sessions from remote nodes to the one receiving the request,
>>> because you can guarantee that the session is already there.
>>>   
>>
>>
>> The price to pay for this is that you always need to replicate state 
>> to any
>> node to which the request may be directed. If you allow for a locate 
>> phase,
>> then you can minimise the set of nodes to which data is replicated (the
>> buddy-group) because any node can find it later. In a high-affinity
>> configuration this reduces the overall load.
>>
>> Consider a four node partition A,B,C,D. In the 'replicate-everywhere' 
>> model,
>> A's state is replicated to three other nodes after every request, 
>> incurring
>> the processing cost on three nodes (assuming network multicast). If A 
>> dies,
>> any node can instantly pick up the work. The issue is we have a lot of
>> overhead to reduce the latency in the event of node death (which we 
>> hope is
>> infrequent).
>>
>> The other alternative is that every session has one and only one 
>> buddy. This
>> would result in 1/3 of A's sessions being replicated to B, 1/3 to C 
>> and 1/3
>> to D. Each session is replicated to just one node, allowing unicast 
>> to be
>> used (which has a lower overhead than multicast) and only incurring the
>> ongoing processing cost on one node.
>>
>> If A dies, then B,C,D pick new buddies for A's sessions and do bulk 
>> state
>> transfer to redistribute, ensuring that the state is always stored on 
>> two
>> nodes. Say B transfers to C, C to D and D to B. Again, unicast 
>> transfer. You
>> can avoid this if you are willing to lose a session if another node dies
>> (double failure scenario).
>>
>> An A request is now directed to a random node; if this node has the 
>> state,
>> then it becomes the primary and starts replicating to its buddy. If 
>> it does
>> not, then it sends a multicast inquiry to the partition, locates the 
>> state,
>> does a second transfer and starts replicating again.
>>
>> The trade off is lower overhead whilst running but a larger state 
>> transfer
>> in the event of node death. I tend to prefer the latter on the basis 
>> that
>> node deaths are infrequent.
>>
>>
>>  
>>
>>> Are we getting closer ?
>>>
>>>   
>>
>> :-)
>>
>> -- 
>> Jeremy
>>
>>  
>>
>
>


-- 
/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 **********************************/



Mime
View raw message