geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: [clustering] automatic partitoning, state bucketing, etc... (long)
Date Mon, 18 Aug 2003 09:29:28 GMT
Jeremy Boynes wrote:

>>I figure that we are talking about two different and orthogonal types of
>>partition here.
>>    
>>
>Agreed.
>
>  
>
>>I'm happy to call the way that nodes are linked into buddy-groups
>>(groups of peers that store replicated state for each other) something
>>other than 'partition', if we want to reserve that term for some sort of
>>cluster management concept, but you do agree that these structures
>>exist, do you not ? regardless of what they are called, otherwise you do
>>not scale, as we have all agreed.
>>
>>As for loadbalancer configuration I think this will draw upon both
>>'jeremy-partition' and 'jules-buddy-group' status as :
>>
>>- you only want to balance requests for a webapp to nodes on which it is
>>deployed
>>    
>>
>Yes
>
>  
>
>>- you only want to fail-over requests to other nodes in the same
>>buddy-group as the failed node
>>    
>>
>Ideally, yes but this is not essential. See below.
>
>  
>
>>if you can do the latter you can avoid cluster-wide logic for findg and
>>migrating sessions from remote nodes to the one receiving the request,
>>because you can guarantee that the session is already there.
>>    
>>
>
>The price to pay for this is that you always need to replicate state to any
>node to which the request may be directed. If you allow for a locate phase,
>then you can minimise the set of nodes to which data is replicated (the
>buddy-group) because any node can find it later. In a high-affinity
>configuration this reduces the overall load.
>
this is two sides of the same coin :-)

suppose I can instruct mod_jk (which I can) to deliver every request 
tied to a particular session, to a subset of nodes of the cluster.

would it not make sense that these nodes were the 'buddy-group' ?

then we can forget the locate phase altogether...

This is my current impl.

>
>Consider a four node partition A,B,C,D. In the 'replicate-everywhere' model,
>A's state is replicated to three other nodes after every request, incurring
>the processing cost on three nodes (assuming network multicast). If A dies,
>any node can instantly pick up the work. The issue is we have a lot of
>overhead to reduce the latency in the event of node death (which we hope is
>infrequent).
>
>The other alternative is that every session has one and only one buddy. This
>would result in 1/3 of A's sessions being replicated to B, 1/3 to C and 1/3
>to D. Each session is replicated to just one node, allowing unicast to be
>used (which has a lower overhead than multicast) and only incurring the
>ongoing processing cost on one node.
>
OK - a couple of points here...

1. the decision about how many buddies should be in a group, should be 
taken at the logical level. If, in cases of a single buddy, the 
transport can be optimised to reduce latency, then so much the better...

The point about unicast vs multicast and the frequency of node death is 
good...

2. you are touching on what I described as 'bucketing' - how many 
buckets should a nodes sessions be split into and where should they be 
replicated to. My jury is still out on this...

>
>If A dies, then B,C,D pick new buddies for A's sessions and do bulk state
>transfer to redistribute, ensuring that the state is always stored on two
>nodes. Say B transfers to C, C to D and D to B. Again, unicast transfer. You
>can avoid this if you are willing to lose a session if another node dies
>(double failure scenario).
>
if the number of buddies per team were configurable and the 
multi/unicast optimisation automagic, architects could choose whether to 
pay for the extra robustness or not.

>
>An A request is now directed to a random node; if this node has the state,
>then it becomes the primary and starts replicating to its buddy. If it does
>not, then it sends a multicast inquiry to the partition, locates the state,
>does a second transfer and starts replicating again.
>  
>
Now we are getting into territory where the way that the LB works 
impacts on the space we have to work with....

I'm trying to avoid writing my own LB, but to come up with something 
that can work with mod_jk[2]. This constrains me more than you.

Drill down a little into the behaviour that you would require from an 
lb, and lets see where we go...

>The trade off is lower overhead whilst running but a larger state transfer
>in the event of node death. I tend to prefer the latter on the basis that
>node deaths are infrequent.
>
agreed

Jules

>
>
>  
>
>>Are we getting closer ?
>>
>>    
>>
>:-)
>
>--
>Jeremy
>
>  
>


-- 
/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 **********************************/



Mime
View raw message