geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <>
Subject Re: Replication using totem protocol
Date Mon, 16 Jan 2006 12:43:00 GMT
lichtner wrote:

>As Jules requested I am looking at the AC api. I report my observations
>ClusterEvent appears to represent membership-related events. These you
>can generate from evs4j, as follows: write an adapter that implements
>evs4j.Listener. In the onConfiguration(..) method you get notified of
>new configurations (new groups). You can generate ClusterEvent.ADD_NODE
>etc. by doing a diff of the old configuration and the new one.
>Evs4j does not support arbitrary algorithms for electing coordinators.
>In fact, in totem there is no coordinator. If a specific election is
>important for you, you can design one using totem's messages. If not,
>in evs4j node names are integers, so the coordinator can be the lowest
>integer. This is checked by evs4j.Configuration.getCoordinator().
>I don't know the difference between REMOVE_NODE and FAILED_NODE. In totem
>there is no difference between the two.
REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies ...

I don't think REMOVE_NODE is actually tied to an Event in the external 

>The only other class I think I need to comment on is Cluster. It
>resembles a jms session, even being coupled to actual jms interfaces. You
>can definitely implement producers and consumers and put them on top of
>evs4j. The method send(Destination, Message) would have to encode Message
>on top of fixed-length evs4j messages. No problem here.
>Personally, I would not have mixed half the jms api with an original api.
>I don't think it sends a good message as far as risk management goes. I
>think people are prepared to deal with a product that says 'we assume jms'
>or 'we are completely home-grown because we are so much better', but not a
>mix of the two. Anyway that's not for me to say. Whatever works.
I'll leave this one to James.

>In conclusion, yes, I think you could build an implementation of AC on top
>of evs4j.
>BTW, how does AC defend against the problem of a split-brain cluster?
>Shared scsi disk? Majority voting? Curious.
Well, I think AC's approach is that it is an app-space problem - but 
James may care to comment.

As an AC user I am considering the following approach for WADI 
(HttpSession and SFSB clustering solution).

(1) simplifying the notifications that I might get from a cluster to the 
following :
- Split : 0-N nodes have left the cluster
- Merge : 0-N nodes have joined the cluster
- Change : A node has updated it's public, distributed state

'Split' should now be generic enough to encompass the following common 

- a node leaving cleanly (having evacuated its state, therefore carrying 
NO state) [clean node shutdown]
- a node dieing (therefore carrying state) [catastrophic failure]
- a group of nodes falling out of contact (still carrying state) 
[network partition]

'Join' can encompass:

- a new node joining (therefore carrying NO state).
- a group of nodes coming back into contact after a split (carrying 
state that needs to be merged) [network healing]


- same as in AC - each node makes public by distribution, a small amount 
of data, this is republished each time it is updated.

By also treating nodes joining, leaving and dieing, as split and merge 
operations I can reduce the number of cases that I have to deal with, 
and ensure that what might be very uncommonly run code (run on network 
partition/healing) is the same code that is commonly run on e.g. node 
join/leave - so it is likely to be more robust.

In the case of a binary split, I envisage two sets of nodes losing 
contact with each other. Each cluster fragment will repair its internal 
structure. I expect that after this repair, neither fragment will carry 
a complete copy of the cluster's original state (unless we are 
replicating 1->all, which WADI will usually not do), rather, the two 
datasets will intersect and their union will be the original dataset. 
Replicated state will carry a version number.

If client affinity survives the split (i.e. clients continue to talk to 
the same nodes), then we should find ourselves in a working state, with 
two smaller clusters carrying overlapping and diverging state. Each 
piece of state should be static in one subcluster and divergant in the 
other (it has only one client). The version carried by each piece of 
state may be used to decide which is the most recent version.

(If client affinity is not maintained, then, without a backchannel of 
some sort, we are in trouble).

When a merge occurs, WADI will be able to merge the internal 
representations of the participants, delegating awkward decisions about 
divergant state to deploy-time pluggable algorithms. Hopefully, each 
piece of state will only have diverged in one cluster fragment so the 
choosing which copy to go forward with will be trivial.

A node death can just be thought of as a 'split' which never 'merges'.

Of course, multiple splits could occur concurrently and merging them is 
a little more complicated than I may have implied, but I am getting 

WADI's approach should work for HttpSession and SFSB, where there is a 
single client who will be talking to a single node. In the case of some 
other type, where clients for the same resource may end up in different 
cluster fragments, this approach will be insufficient.

I would be very interested in hearing your thoughts on the subject,



"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 * Open Source Training & Support.

View raw message