geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Piper <an...@bea.com>
Subject Re: Clustering (long)
Date Tue, 02 Aug 2005 12:06:51 GMT
Hi Jules

At 05:37 AM 7/27/2005, Jules Gosnell wrote:

>I agree on the SPoF thing - but I think you misunderstand my 
>Coordinator arch. I do not have a single static Coordinator node, 
>but a dynamic Coordinator role, into which a node may be elected. 
>Thus every node is a potential Coordinator. If the elected 
>Coordinator dies, another is immediately elected. The election 
>strategy is pluggable, although it will probably end up being 
>hardwired to "oldest-cluster-member". The reason behind this is that 
>relaying out your cluster is much simpler if it is done in a single 
>vm. I originally tried to do it in multiple vms, each taking 
>responsibility for pieces of the cluster, but if the vms views are 
>not completely in sync, things get very hairy, and completely in 
>sync is an expensive thing to achieve - and would introduce a 
>cluster-wide single point of contention. So I do it in a single vm, 
>as fast as I can, with fail over, in case that vm evaporates. Does 
>that sound better than the scenario that you had in mind ?

This is exactly the "hard" computer science problem that you 
shouldn't be trying to solve if at all possible. Its hard because 
network partitions or hung processes (think GC) make it very easy for 
your colleagues to think you are dead when you do not share that 
view. The result is two processes who think they are the coordinator 
and anarchy can ensue (commonly called split-brain syndrome). I can 
point you at papers if you want, but I really suggest that you aim 
for an implementation that is independent of a central coordinator. 
Note that a central coordinator is necessary if you want to implement 
a strongly-consistent in-memory database, but this is not usually a 
requirement for session replication say.

http://research.microsoft.com/Lampson/58-Consensus/Abstract.html 
gives a good introduction to some of these things. I also presented 
at JavaOne on related issues, you should be able to download the 
presentation from dev2dev.bea.com at some point (not there yet - I 
just checked).

>The Coordinator is not there to support session replication, but 
>rather the management of the distributed map (map of which a few 
>buckets live on each node) which is used by WADI to discover very 
>efficiently whether a session exists and where it is located. This 
>map must be rearranged, in the most efficient way possible, each 
>time a node joins or leaves the cluster.

Understood. Once you have a fault-tolerant singleton coordinator you 
can solve lots of interesting problems, its just hard and often not 
worth the effort or the expense (typical implementations involve HA 
HW or an HA DB or at least 3 server processes).

>Replication is NYI - but I'm running a few mental background threads 
>that suggest that an extension to the index will mean that it 
>associates the session's id not just to its current location, but 
>also to the location of a number of replicants. I also have ideas on 
>how a session might choose nodes into which it will place its 
>replicants and how I can avoid the primary session copy ever being 
>colocated with a replicant (potential SPoF - if you only have one 
>replicant), etc...

Right definitely something you want to avoid.

>Yes, I can see that happening - I have an improvement (NYI) to 
>WADI's evacuation strategy (how sessions are evacuated when a node 
>wishes to leave). Each session will be evacuated to the node which 
>owns the bucket into which its id hashes. This is because colocation 
>of the session with the bucket allows many messages concered with 
>its future destruction and relocation to be optimised away. Future 
>requests falling elsewhere but needing this session should, in the 
>most efficient case, be relocated to this same node, other wise the 
>session may be relocated, but at a cost...

How do you relocate the request? Many HW load-balancers do not 
support this (or else it requires using proprietary APIs), so you 
probably have to count on
moving sessions in the normal failover case.

>I would be very grateful in any thoughts or feedback that you could 
>give me. I hope to get much more information about WADI into the 
>wiki over the next few weeks. That should help generate more 
>discussion, although I would be more than happy for people to ask me 
>questions here on Geronimo-dev because this will give me an idea of 
>what documentation I should write and how existing documentation may 
>be lacking or misleading.

I guess my general comment would be that you might find it better to 
think specifically about the end-user problem you are trying to solve 
(say session replication) and work towards a solution based on that. 
Most short-cuts / optimizations that vendors make are specific to the 
problem domain and do not generally apply to all clustering problems.

Hope this helps

andy 



Mime
View raw message