geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Bohn <>
Subject Re: Clustering (long)
Date Wed, 03 Aug 2005 12:47:12 GMT
You can define an order to the semaphores when locking and thereby avoid 
a deadlock.  If each node being added or terminating itself honors the 
order then you will never have a deadlock.   However, you still need to 
deal with the case of an uncontrolled failure either adding or removing 
a note and possibly never releasing a lock.


Jules Gosnell wrote:

> hmm... hmmm... :-)
> more thoughts on (1) and (2)...
> When a node leaves/joins it needs to acquire a lease on the bucket 
> tables of every node that it intends to move buckets from/to. If two 
> nodes are doing this at the same time, their requirement will collide 
> (deadlock) somewhere in the cluster. At this point they may be 
> notified and e.g. compare ip addresses to decide who continues and who 
> backs off for a while.
> So, (1) and (2), whilst being possible are probably more complex than 
> I initially imagined. If we have Paxos for the more general purpose 
> case (3) anyway, it would probably be smart just to go with this, 
> until such optimisations becomes necessary, if at all.
> Jules
> Jules Gosnell wrote:
>> hmmm...
>> now I'm wondering about my solutions to (1) and (2) - if more than 
>> one node tries to join or leave at the same time I may be in trouble 
>> - so it may be safer to go straight to (3) for all cases...
>> more thought needed :-)
>> Jules
>> Jules Gosnell wrote:
>>> I've had a look at the Lampson paper, but didn't take it all in on 
>>> the first pass - I think it will need some serious concentration. 
>>> The Paxos algorithm looks interesting, I will definitely pursue this 
>>> avenue.
>>> I've also given a little thought to exactly why I need a Coordinator 
>>> and how Paxos might be used to replace it. My use of a Coordinator 
>>> and plans for its future do not actually seem that far from Paxos, 
>>> on a preliminary reading.
>>> Given that WADI currently uses a distributed map of 
>>> sessionId:sessionLocation, that this distribution is achieved by 
>>> sharing out responsibility for the set number of buckets that 
>>> comprise the map roughly evenly between the cluster members and that 
>>> this is currently my most satisfying design, I can break my problem 
>>> space (for bucket arrangement) down into 3 basic cases :
>>> 1) Node joins
>>> 2) Node leaves in controlled fashion
>>> 3) Node dies
>>> If the node under discussion is the only cluster member, then no 
>>> bucket rearrangement is necessary - this node will either create or 
>>> destroy the full set of buckets. I'll leave this set of subcases as 
>>> trivial.
>>> 1)  The joining node will need to assume responsibility for a number 
>>> of buckets. If buckets-per-node is to be kept roughly the same for 
>>> every node, it is likely that the joining node will require transfer 
>>> of a small number of buckets from every current cluster member i.e. 
>>> we are starting a bucket rearrangement that will involve every 
>>> cluster member and only need be done if the join is successful. So, 
>>> although we wish to avoid an SPoF, if that SPoF turns out to be the 
>>> joining node, then I don't see it as a problem, If the node joining 
>>> dies, then we no longer have to worry about rearranging our buckets 
>>> (unless we have lost some that had already been transferred - see 
>>> (3)). Thus the joining node may be used as a single 
>>> Coordinator/Leader for this negotiation without fear of the SPoF 
>>> problem. Are we on the same page here ?
>>> 2) The same argument may be applied in reverse to a node leaving in 
>>> a controlled fashion. It will wish to evacuate its buckets roughly 
>>> equally to all remaining cluster members. If it shuts down cleanly, 
>>> this would form part of its shutdown protocol. If it dies before or 
>>> during the execution of this protocol then we are back at (3), if 
>>> not, then the SPoF issue may again be put to one side.
>>> 3) This is where things get tricky :-) Currently WADI has, for the 
>>> sake of simplicity, one single algorithm / thread / point-of-failure 
>>> which recalculates a complete bucket arrangement if it detects (1), 
>>> (2) or (3). It would be simple enough to offload the work done for 
>>> (1) and (2) to the node joining/leaving and this should reduce 
>>> wadi's current vulnerability, but we still need to deal with 
>>> catastrophic failure. Currently WADI rebuilds the missing buckets by 
>>> querying the cluster for the locations of any sessions that fall 
>>> within them, but it could equally carry a replicated backup and dust 
>>> it off as part of this procedure. It's just a trade-off between work 
>>> done up front and work done in exceptional circumstance... This is 
>>> the place where the Paxos algorithm may come in handy - bucet 
>>> recomposition and rearrangement. I need to give this further 
>>> thought. For the immediate future, however, I think WADI will stay 
>>> with a single Coordinator in this situation, which fails-over if 
>>> says it should - I'm delegating 
>>> the really thorny problem to James :-). I agree with you that this 
>>> is an SPoF and that WADI's ability to recover from failure here 
>>> depends directly on how we decide if a node is alive or dead - a 
>>> very tricky thing to do.
>>> In conclusion then, I think that we have usefully identified a 
>>> weakness that will become more relevant as the rest of WADI's 
>>> features mature. The Lampson paper mentioned describes an algorithm 
>>> for allowing nodes to reach a consensus on actions to be performed, 
>>> in a redundant manner with no SPoF and I shall consider how this 
>>> might replace WADI's currently single Coordintor, whilst also 
>>> looking at performing other Coordination on joining/leaving nodes 
>>> where its failure, coinciding with that of its host node, will be 
>>> irrelevant, since the very condition that it was intended to resolve 
>>> has ceased to exist.
>>> How does that sound, Andy ? Do you agree with my thoughts on (1) & 
>>> (2) ? This is great input - thanks,
>>> Jules
>>> Jules Gosnell wrote:
>>>> Andy Piper wrote:
>>>>> Hi Jules
>>>>> At 05:37 AM 7/27/2005, Jules Gosnell wrote:
>>>>>> I agree on the SPoF thing - but I think you misunderstand my 
>>>>>> Coordinator arch. I do not have a single static Coordinator node,

>>>>>> but a dynamic Coordinator role, into which a node may be elected.

>>>>>> Thus every node is a potential Coordinator. If the elected 
>>>>>> Coordinator dies, another is immediately elected. The election 
>>>>>> strategy is pluggable, although it will probably end up being 
>>>>>> hardwired to "oldest-cluster-member". The reason behind this is 
>>>>>> that relaying out your cluster is much simpler if it is done in a

>>>>>> single vm. I originally tried to do it in multiple vms, each 
>>>>>> taking responsibility for pieces of the cluster, but if the vms 
>>>>>> views are not completely in sync, things get very hairy, and 
>>>>>> completely in sync is an expensive thing to achieve - and would 
>>>>>> introduce a cluster-wide single point of contention. So I do it 
>>>>>> in a single vm, as fast as I can, with fail over, in case that vm

>>>>>> evaporates. Does that sound better than the scenario that you had

>>>>>> in mind ?
>>>>> This is exactly the "hard" computer science problem that you 
>>>>> shouldn't be trying to solve if at all possible. Its hard because 
>>>>> network partitions or hung processes (think GC) make it very easy 
>>>>> for your colleagues to think you are dead when you do not share 
>>>>> that view. The result is two processes who think they are the 
>>>>> coordinator and anarchy can ensue (commonly called split-brain 
>>>>> syndrome). I can point you at papers if you want, but I really 
>>>>> suggest that you aim for an implementation that is independent of 
>>>>> a central coordinator. Note that a central coordinator is 
>>>>> necessary if you want to implement a strongly-consistent in-memory 
>>>>> database, but this is not usually a requirement for session 
>>>>> replication say.
>>>>> gives a good introduction to some of these things. I also 
>>>>> presented at JavaOne on related issues, you should be able to 
>>>>> download the presentation from at some point (not 
>>>>> there yet - I just checked).
>>>> OK - I will have a look at these papers and reconsider... perhaps I 
>>>> can come up with some sort of fractal algorithm which recursively 
>>>> breaks down the cluster into subclusters each of which is capable 
>>>> of doing likewise to itself and then  layout the buckets 
>>>> recursively via this metaphor... - this would be much more robust, 
>>>> as you point out, but, I think, a more complicated architecture. I 
>>>> will give it some serious thought. Have you any suggestions/papers 
>>>> as to how you might do something like this in a distributed manner, 
>>>> bearing in mind that as a node joins, some existing nodes will see 
>>>> it as having joined and some will not yet have noticed and 
>>>> vice-versa on leaving....
>>>>>> The Coordinator is not there to support session replication, but

>>>>>> rather the management of the distributed map (map of which a few

>>>>>> buckets live on each node) which is used by WADI to discover very

>>>>>> efficiently whether a session exists and where it is located. 
>>>>>> This map must be rearranged, in the most efficient way possible,

>>>>>> each time a node joins or leaves the cluster.
>>>>> Understood. Once you have a fault-tolerant singleton coordinator 
>>>>> you can solve lots of interesting problems, its just hard and 
>>>>> often not worth the effort or the expense (typical implementations 
>>>>> involve HA HW or an HA DB or at least 3 server processes).
>>>> Since I am only currently using the singleton coordinator for 
>>>> bucket arrangement, I may just live with it for the moment, in 
>>>> order to move forward, but make a note to replace it and start 
>>>> background threads on how that might be achieved...
>>>>>> Replication is NYI - but I'm running a few mental background 
>>>>>> threads that suggest that an extension to the index will mean 
>>>>>> that it associates the session's id not just to its current 
>>>>>> location, but also to the location of a number of replicants. I 
>>>>>> also have ideas on how a session might choose nodes into which it

>>>>>> will place its replicants and how I can avoid the primary session

>>>>>> copy ever being colocated with a replicant (potential SPoF - if 
>>>>>> you only have one replicant), etc...
>>>>> Right definitely something you want to avoid.
>>>>>> Yes, I can see that happening - I have an improvement (NYI) to 
>>>>>> WADI's evacuation strategy (how sessions are evacuated when a 
>>>>>> node wishes to leave). Each session will be evacuated to the node

>>>>>> which owns the bucket into which its id hashes. This is because 
>>>>>> colocation of the session with the bucket allows many messages 
>>>>>> concered with its future destruction and relocation to be 
>>>>>> optimised away. Future requests falling elsewhere but needing 
>>>>>> this session should, in the most efficient case, be relocated to

>>>>>> this same node, other wise the session may be relocated, but at a

>>>>>> cost...
>>>>> How do you relocate the request? Many HW load-balancers do not 
>>>>> support this (or else it requires using proprietary APIs), so you 
>>>>> probably have to count on
>>>>> moving sessions in the normal failover case.
>>>> If I can squeeze the behaviour that I require out of the 
>>>> load-balancer, then, depending on the request type I may be able to 
>>>> get away with a redirection with a changed session cookie or url 
>>>> param, or, failing this an http-proxy, across from a filter above 
>>>> the servlet on one side to the http-port on the node that owns the 
>>>> session...
>>>> The LB-integration object is pluggable and the aim is to supply 
>>>> wadi with a good selection of LB integrations - currently I only 
>>>> have a ModJK[2] plugin working. This is able to 'restick' clients 
>>>> to their session's new location (although messing with the session 
>>>> id is a little dodgy...).
>>>>>> I would be very grateful in any thoughts or feedback that you 
>>>>>> could give me. I hope to get much more information about WADI 
>>>>>> into the wiki over the next few weeks. That should help generate

>>>>>> more discussion, although I would be more than happy for people 
>>>>>> to ask me questions here on Geronimo-dev because this will give 
>>>>>> me an idea of what documentation I should write and how existing

>>>>>> documentation may be lacking or misleading.
>>>>> I guess my general comment would be that you might find it better 
>>>>> to think specifically about the end-user problem you are trying to 
>>>>> solve (say session replication) and work towards a solution based 
>>>>> on that. Most short-cuts / optimizations that vendors make are 
>>>>> specific to the problem domain and do not generally apply to all 
>>>>> clustering problems.
>>>> The end problem is really clustered web and ejb sessions at the 
>>>> moment, although it looks as if by the time we have solved these 
>>>> issues we may well have written a fault-tolerant 
>>>> distributed/partitioned index that might be very useful as a 
>>>> generic distributed cache building block.
>>>> One thing that I do want wadi to do, is to still work when 
>>>> replication is switched off. i.e., if a session only exists as a 
>>>> primary copy, even if affinity breaks down, wadi will continue to 
>>>> correctly render requests for that session unless some form of 
>>>> catastrophic failure causes the session to evaporate. This means 
>>>> that I need to ensure the session's timely evacuation from a node 
>>>> that chooses to leave the cluster to a remaining node, so that it 
>>>> may remain active beyond the lifetime of its original node. All of 
>>>> this must work flawlessly under stress, so that an admin may add or 
>>>> remove nodes to a running cluster without having to worry about the 
>>>> user state that it is managing. Nodes are added by simply starting 
>>>> them, and nodes removed via e.g. ctl-c-ing them.
>>>> If it is decided that a few more nines are needed in terms of 
>>>> session availability and the cluster owner understands the extra 
>>>> cost involved in in-vm replication in terms of extra hardware and 
>>>> bandwidth that they will have to purchase and is happy to go with 
>>>> in-vm-replication, then it should be sufficient to up the number of 
>>>> replicated copies kept by the cluster from '0' to e.g. '2' and 
>>>> restart (It might even be possible to vary this setting on a node 
>>>> to node basis so that this change does not even involve a complete 
>>>> cluster cold start). WADI should deal with the rest.
>>>> So, I believe that I have a pretty clear idea of what WADI will do, 
>>>> and aside from the replication stuff (phase2) it currently does 
>>>> most of what iIhad in mind for phase1, except that it is not yet 
>>>> happy under stress. I figure it will probably take one or two more 
>>>> redesign/reimplementation iterations to get it to this stage, then 
>>>> I can consider replication.
>>>> I have spoken to members of the OpenEJB team about  wadi's ability 
>>>> to relocate requests as well as sessions and we came to the 
>>>> conclusion that it was just as applicable in the EJB world as the 
>>>> web world. If the node an ejb client is talking to leaves the 
>>>> cluster in between calls, the client may try to contact it and then 
>>>> failover to another node that it hopes holds the session. If, due 
>>>> to other nodes leaving/joining it is not always clear which node 
>>>> will contain the session, the ability to reply to an RMI and just 
>>>> say "not here - there!" - i.e. an rmi redirection - would not be 
>>>> hard to add and would resolve this situation. Transactions are 
>>>> another item which I have marked phase2.
>>>> So, I am trying hard to stay very focussed on the problem domain, 
>>>> otherwise this will never get finished :-)
>>>> Right, off to read those papers now - thanks for your posting and 
>>>> your interest,
>>>> Jules
>>>>> Hope this helps
>>>>> andy 

Joe Bohn 
"He is no fool who gives what he cannot keep, to gain what he cannot lose."   -- Jim Elliot

View raw message