geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lichtner <>
Subject Re: Replication using totem protocol
Date Wed, 18 Jan 2006 18:13:40 GMT

On Wed, 18 Jan 2006, Jules Gosnell wrote:

> I haven't been able to convince myself to take the quorum approach
> because...
> shared-something approach:
> - the shared something is a Single Point of Failure (SPoF) - although
> you could use an HA something.

It's not really a spof. You just fail over to a different resource. All
you need is a lock. You could use two java processes anywhere on the
network which listen for a socket, and only one. If one is not listening,
you try the other one.

> - If the node holding the lock 'goes crazy', but does not die, the rest
> of the cluster becomes a fragment - so it becomes an SPoF as well.

If by 'goes crazy' you mean that it's up but it's not doing anything,
totem defends against by detecting processors which fail to make progress
after N token rotations, and when they do it declares them failed.

But if you mean that it just sends corrupt data or starts using broken
algorithms etc. then I would need to research it a bit. But definitely
defending against these byzantine failures will be more expensive. I
believe the solution is that you have to process operations on multiple
nodes and compare the results.

I believe this is how Tandem machines work. Each cpu step is voted on.
Byzantine failures can happen because of cosmic rays, or other
physics-related issues.

Definitely much more fun.

> - used in isolation, it does not take into account that the lock may be
> held by the smallest cluster fragment

Yes, it does. The question is, why do you have a partition? If you have a
partition because a network element failed, then put some redundancy in
your network topology. If the partition is virtual, i.e.
congestion-induced, then wait a few seconds for it to heal.

And if you get too many virtual partitions it means you either need to
tweak your failure detection parameters (token-loss-timeout in totem) or
your load is too high and you need to add some capacity to the cluster.

> shared-nothing approach:
> - I prefer this approach, but, as you have stated, if the two halves are
> equally sized...

I didn't mean to say that. In this approach you _must_ set a minimum
quorum which is _the_ majority of the size of the rack. If you own five
machines, make the quorum three.

> - What if there are two concurrent fractures (does this happen?)

It's no different than any other partition.

> - ActiveCluster notifies you of one membership change at a time - so you
> would have to decide on an algorithm for 'chunking' node loss, so that
> you could decide when a fragmentation had occurred...

The problem exists anyway. Even in totem you can have several
'configurations' installed in quick succession. In order to defend against
this you need to design your state transfer algorithms around it.

> perhaps a hybrid of the two would be able to cover more bases... -
> shared-nothing falling back to shared-something if your fragment is
> sized N/2.

You can definitely make a totally custom algorithm for determining your
majority partition, and it's a fact that hybrid approaches can solve
difficult problems, but for the reasons I said above I believe that you
can just fine if you have a redundant network and you keep some cpu

> As far as my plans for WADI, I think I am happy to stick with the, 'rely
> on affinity and keep going' approach.
> As far as situations where a distributed object may have more than one
> client, I can see that quorum offers the hope of a solution, but,
> without some very careful thought, I would still be hesitant to stake my
> shirt on it :-) for the reasons given above...
> I hadn't really considered 'pausing' a cluster fragment, so this is a
> useful idea. I guess that I have been thinking more in terms of
> long-lived fractures, rather than short-lived ones. If the latter are
> that much more common, then this is great input and I need to take it
> into account.
> The issue about 'chunking' node loss interests me... I see that the
> EVS4J Listener returns a set of members, so it is possible to express
> the loss of more than one node. How is membership decided and node loss
> aggregated ?

Read the totem protocol article. The membership protocol is in there. But
as I said you can still get a flurry of configurations installed one after
the other. It is only a problem if you plan to do your cluster
re-organization all at once.


View raw message