geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: Replication using totem protocol
Date Mon, 16 Jan 2006 23:07:48 GMT
lichtner wrote:

>On Mon, 16 Jan 2006, Jules Gosnell wrote:
>
>  
>
>>REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies ...
>>    
>>
>
>I figured. I imagine that if I had to add this distinction to totem I
>would add a message were the node in question announces that it is
>leaving, and then stops forwarding the token. On the other hand, it does
>not need to announce anything, and the other nodes will detect that it
>left. In fact totem does not judge a node either way: you can leave
>because you want to or under duress, and the consequences as far
>distribute algorithms are probably minimal. I think the only where this
>might is for logging purposes (but that could be handled at the
>application level) or to speed the membership protocol, although it's
>already pretty fast.
>
>So I would not draw a distinction there.
>
>  
>
>>By also treating nodes joining, leaving and dieing, as split and merge
>>operations I can reduce the number of cases that I have to deal with.
>>    
>>
>
>I would even add that the difference is known only to the application.
>
>  
>
>>and ensure that what might be very uncommonly run code (run on network
>>partition/healing) is the same code that is commonly run on e.g. node
>>join/leave - so it is likely to be more robust.
>>    
>>
>
>Sounds good.
>
>  
>
>>In the case of a binary split, I envisage two sets of nodes losing
>>contact with each other. Each cluster fragment will repair its internal
>>structure. I expect that after this repair, neither fragment will carry
>>a complete copy of the cluster's original state (unless we are
>>replicating 1->all, which WADI will usually not do), rather, the two
>>datasets will intersect and their union will be the original dataset.
>>Replicated state will carry a version number.
>>    
>>
>
>I think a version number should work very well.
>
>  
>
>>If client affinity survives the split (i.e. clients continue to talk to
>>the same nodes), then we should find ourselves in a working state, with
>>two smaller clusters carrying overlapping and diverging state. Each
>>piece of state should be static in one subcluster and divergant in the
>>other (it has only one client). The version carried by each piece of
>>state may be used to decide which is the most recent version.
>>
>>(If client affinity is not maintained, then, without a backchannel of
>>some sort, we are in trouble).
>>
>>When a merge occurs, WADI will be able to merge the internal
>>representations of the participants, delegating awkward decisions about
>>divergant state to deploy-time pluggable algorithms. Hopefully, each
>>piece of state will only have diverged in one cluster fragment so the
>>choosing which copy to go forward with will be trivial.
>>    
>>
>
>  
>
>>A node death can just be thought of as a 'split' which never 'merges'.
>>    
>>
>
>Definitely :)
>
>  
>
>>Of course, multiple splits could occur concurrently and merging them is
>>a little more complicated than I may have implied, but I am getting
>>there....
>>    
>>
>
>Although I consider the problem of session replication less than
>glamorous, since it is at hand, I would approach it this way:
>
>1. The user should configure a minimum-degree-of-replication R. This is
>the number of replicas of a specific session which need to be available in
>order for an HTTP request to be serviced.
>  
>
WADI's current architecture is similar. I would specify an e.g. 
max-num-of-backups. Provided that the cluster has sufficient members, 
each session would maintain this number of backup copies.

The fact that one session is the primary and the others are backups is 
an important distinction that WADI makes. Making/refreshing session 
backups involves serialisation. Serialisation of an HttpSession may 
involve the notification of passivation to components within the 
session. It is important that only one copy of the session thinks that 
it is the 'active' copy at any one time.

>2. When an HTTP request arrives, if the cluster which received does not
>have R copies then it blocks (it waits until there are.) This should in
>data centers because partitions are likely to be very short-lived (aka
>virtual partitions, which are due to congestion, not to any hardware
>issue.)
>  
>
Interesting. I was intending to actively repopulate the cluster 
fragment, as soon as the split was detected. I figure that
- the longer that sessions spend without their full complement of 
backups, the more likely that a further failure may result in data loss.
- the split is an exceptional cicumstance at which you would expect to 
pay an exceptional cost (regenerating missing primaries from backups and 
vice-versa)

by waiting for a request to arrive for a session before ensuring it has 
its correct complement of backups, you extend the time during which it 
is 'at risk'. By doing this 'lazily', you will also have to perform an 
additional check on every request arrival, which you would not have to 
do if you had regenerated missing state at the point that you noticed 
the split.

having said this, if splits are generally shortlived then I can see that 
this approach would save a lot of cycles.

maybe there is an intermediate approach ? after a split is detected, you 
react lazily for a while, then you decide that the problem is not 
shortlived and regenerate the remaining missing structure in your 
cluster fragment.

>3. If at any time an HTTP reaches a server which does not have itself a
>replica of the session it sends a client redirect to a node which does.
>  
>
WADI can relocate request to session, as you suggest (via redirect or 
proxy), or session to request, by migration. Relocation of request 
should scale better since requests are generally smaller and, in the web 
tier, may run concurrently through the same session, whereas sessions 
are generally larger and may only be migrated serially (since only one 
copy at a time may be 'active').

>4. When a new cluster is formed (with nodes coming or going), it takes an
>inventory of all the sessions and their version numbers. Sessions which do
>not have the necessary degree of replication need to be fixed, which will
>require some state transfer,
>
yes - agreed

> and possibly migration of some session for
>proper load balancing.
>  
>
forcing the balancing of state around the cluster is something that I 
have considered with WADI, but not yet tried to implement. The type of 
load-balancer that is being used has a big impact here. If you cannot 
communicate a change of session location satisfactorily to the Http load 
balancer, then you have to just go with wherever it decides a session is 
located.... With SFSBs we should have much more control at the client 
side, so this becomes a real option.

all in all, though, it sounds like we see pretty much eye to eye :-)

the lazy partition regeneration is an interesting idea and this is the 
second time it has been suggested to me, so I will give it some serious 
thought.

Thanks for taking the time to share your thoughts,


Jules

>Guglielmo
>  
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/


Mime
View raw message