geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <>
Subject Re: Clustering (long)
Date Tue, 02 Aug 2005 23:30:47 GMT

now I'm wondering about my solutions to (1) and (2) - if more than one 
node tries to join or leave at the same time I may be in trouble - so it 
may be safer to go straight to (3) for all cases...

more thought needed :-)


Jules Gosnell wrote:

> I've had a look at the Lampson paper, but didn't take it all in on the 
> first pass - I think it will need some serious concentration. The 
> Paxos algorithm looks interesting, I will definitely pursue this avenue.
> I've also given a little thought to exactly why I need a Coordinator 
> and how Paxos might be used to replace it. My use of a Coordinator and 
> plans for its future do not actually seem that far from Paxos, on a 
> preliminary reading.
> Given that WADI currently uses a distributed map of 
> sessionId:sessionLocation, that this distribution is achieved by 
> sharing out responsibility for the set number of buckets that comprise 
> the map roughly evenly between the cluster members and that this is 
> currently my most satisfying design, I can break my problem space (for 
> bucket arrangement) down into 3 basic cases :
> 1) Node joins
> 2) Node leaves in controlled fashion
> 3) Node dies
> If the node under discussion is the only cluster member, then no 
> bucket rearrangement is necessary - this node will either create or 
> destroy the full set of buckets. I'll leave this set of subcases as 
> trivial.
> 1)  The joining node will need to assume responsibility for a number 
> of buckets. If buckets-per-node is to be kept roughly the same for 
> every node, it is likely that the joining node will require transfer 
> of a small number of buckets from every current cluster member i.e. we 
> are starting a bucket rearrangement that will involve every cluster 
> member and only need be done if the join is successful. So, although 
> we wish to avoid an SPoF, if that SPoF turns out to be the joining 
> node, then I don't see it as a problem, If the node joining dies, then 
> we no longer have to worry about rearranging our buckets (unless we 
> have lost some that had already been transferred - see (3)). Thus the 
> joining node may be used as a single Coordinator/Leader for this 
> negotiation without fear of the SPoF problem. Are we on the same page 
> here ?
> 2) The same argument may be applied in reverse to a node leaving in a 
> controlled fashion. It will wish to evacuate its buckets roughly 
> equally to all remaining cluster members. If it shuts down cleanly, 
> this would form part of its shutdown protocol. If it dies before or 
> during the execution of this protocol then we are back at (3), if not, 
> then the SPoF issue may again be put to one side.
> 3) This is where things get tricky :-) Currently WADI has, for the 
> sake of simplicity, one single algorithm / thread / point-of-failure 
> which recalculates a complete bucket arrangement if it detects (1), 
> (2) or (3). It would be simple enough to offload the work done for (1) 
> and (2) to the node joining/leaving and this should reduce wadi's 
> current vulnerability, but we still need to deal with catastrophic 
> failure. Currently WADI rebuilds the missing buckets by querying the 
> cluster for the locations of any sessions that fall within them, but 
> it could equally carry a replicated backup and dust it off as part of 
> this procedure. It's just a trade-off between work done up front and 
> work done in exceptional circumstance... This is the place where the 
> Paxos algorithm may come in handy - bucet recomposition and 
> rearrangement. I need to give this further thought. For the immediate 
> future, however, I think WADI will stay with a single Coordinator in 
> this situation, which fails-over if 
> says it should - I'm delegating the really thorny problem to James 
> :-). I agree with you that this is an SPoF and that WADI's ability to 
> recover from failure here depends directly on how we decide if a node 
> is alive or dead - a very tricky thing to do.
> In conclusion then, I think that we have usefully identified a 
> weakness that will become more relevant as the rest of WADI's features 
> mature. The Lampson paper mentioned describes an algorithm for 
> allowing nodes to reach a consensus on actions to be performed, in a 
> redundant manner with no SPoF and I shall consider how this might 
> replace WADI's currently single Coordintor, whilst also looking at 
> performing other Coordination on joining/leaving nodes where its 
> failure, coinciding with that of its host node, will be irrelevant, 
> since the very condition that it was intended to resolve has ceased to 
> exist.
> How does that sound, Andy ? Do you agree with my thoughts on (1) & (2) 
> ? This is great input - thanks,
> Jules
> Jules Gosnell wrote:
>> Andy Piper wrote:
>>> Hi Jules
>>> At 05:37 AM 7/27/2005, Jules Gosnell wrote:
>>>> I agree on the SPoF thing - but I think you misunderstand my 
>>>> Coordinator arch. I do not have a single static Coordinator node, 
>>>> but a dynamic Coordinator role, into which a node may be elected. 
>>>> Thus every node is a potential Coordinator. If the elected 
>>>> Coordinator dies, another is immediately elected. The election 
>>>> strategy is pluggable, although it will probably end up being 
>>>> hardwired to "oldest-cluster-member". The reason behind this is 
>>>> that relaying out your cluster is much simpler if it is done in a 
>>>> single vm. I originally tried to do it in multiple vms, each taking 
>>>> responsibility for pieces of the cluster, but if the vms views are 
>>>> not completely in sync, things get very hairy, and completely in 
>>>> sync is an expensive thing to achieve - and would introduce a 
>>>> cluster-wide single point of contention. So I do it in a single vm, 
>>>> as fast as I can, with fail over, in case that vm evaporates. Does 
>>>> that sound better than the scenario that you had in mind ?
>>> This is exactly the "hard" computer science problem that you 
>>> shouldn't be trying to solve if at all possible. Its hard because 
>>> network partitions or hung processes (think GC) make it very easy 
>>> for your colleagues to think you are dead when you do not share that 
>>> view. The result is two processes who think they are the coordinator 
>>> and anarchy can ensue (commonly called split-brain syndrome). I can 
>>> point you at papers if you want, but I really suggest that you aim 
>>> for an implementation that is independent of a central coordinator. 
>>> Note that a central coordinator is necessary if you want to 
>>> implement a strongly-consistent in-memory database, but this is not 
>>> usually a requirement for session replication say.
>>> gives a good introduction to some of these things. I also presented 
>>> at JavaOne on related issues, you should be able to download the 
>>> presentation from at some point (not there yet - I 
>>> just checked).
>> OK - I will have a look at these papers and reconsider... perhaps I 
>> can come up with some sort of fractal algorithm which recursively 
>> breaks down the cluster into subclusters each of which is capable of 
>> doing likewise to itself and then  layout the buckets recursively via 
>> this metaphor... - this would be much more robust, as you point out, 
>> but, I think, a more complicated architecture. I will give it some 
>> serious thought. Have you any suggestions/papers as to how you might 
>> do something like this in a distributed manner, bearing in mind that 
>> as a node joins, some existing nodes will see it as having joined and 
>> some will not yet have noticed and vice-versa on leaving....
>>>> The Coordinator is not there to support session replication, but 
>>>> rather the management of the distributed map (map of which a few 
>>>> buckets live on each node) which is used by WADI to discover very 
>>>> efficiently whether a session exists and where it is located. This 
>>>> map must be rearranged, in the most efficient way possible, each 
>>>> time a node joins or leaves the cluster.
>>> Understood. Once you have a fault-tolerant singleton coordinator you 
>>> can solve lots of interesting problems, its just hard and often not 
>>> worth the effort or the expense (typical implementations involve HA 
>>> HW or an HA DB or at least 3 server processes).
>> Since I am only currently using the singleton coordinator for bucket 
>> arrangement, I may just live with it for the moment, in order to move 
>> forward, but make a note to replace it and start background threads 
>> on how that might be achieved...
>>>> Replication is NYI - but I'm running a few mental background 
>>>> threads that suggest that an extension to the index will mean that 
>>>> it associates the session's id not just to its current location, 
>>>> but also to the location of a number of replicants. I also have 
>>>> ideas on how a session might choose nodes into which it will place 
>>>> its replicants and how I can avoid the primary session copy ever 
>>>> being colocated with a replicant (potential SPoF - if you only have 
>>>> one replicant), etc...
>>> Right definitely something you want to avoid.
>>>> Yes, I can see that happening - I have an improvement (NYI) to 
>>>> WADI's evacuation strategy (how sessions are evacuated when a node 
>>>> wishes to leave). Each session will be evacuated to the node which 
>>>> owns the bucket into which its id hashes. This is because 
>>>> colocation of the session with the bucket allows many messages 
>>>> concered with its future destruction and relocation to be optimised 
>>>> away. Future requests falling elsewhere but needing this session 
>>>> should, in the most efficient case, be relocated to this same node, 
>>>> other wise the session may be relocated, but at a cost...
>>> How do you relocate the request? Many HW load-balancers do not 
>>> support this (or else it requires using proprietary APIs), so you 
>>> probably have to count on
>>> moving sessions in the normal failover case.
>> If I can squeeze the behaviour that I require out of the 
>> load-balancer, then, depending on the request type I may be able to 
>> get away with a redirection with a changed session cookie or url 
>> param, or, failing this an http-proxy, across from a filter above the 
>> servlet on one side to the http-port on the node that owns the 
>> session...
>> The LB-integration object is pluggable and the aim is to supply wadi 
>> with a good selection of LB integrations - currently I only have a 
>> ModJK[2] plugin working. This is able to 'restick' clients to their 
>> session's new location (although messing with the session id is a 
>> little dodgy...).
>>>> I would be very grateful in any thoughts or feedback that you could 
>>>> give me. I hope to get much more information about WADI into the 
>>>> wiki over the next few weeks. That should help generate more 
>>>> discussion, although I would be more than happy for people to ask 
>>>> me questions here on Geronimo-dev because this will give me an idea 
>>>> of what documentation I should write and how existing documentation 
>>>> may be lacking or misleading.
>>> I guess my general comment would be that you might find it better to 
>>> think specifically about the end-user problem you are trying to 
>>> solve (say session replication) and work towards a solution based on 
>>> that. Most short-cuts / optimizations that vendors make are specific 
>>> to the problem domain and do not generally apply to all clustering 
>>> problems.
>> The end problem is really clustered web and ejb sessions at the 
>> moment, although it looks as if by the time we have solved these 
>> issues we may well have written a fault-tolerant 
>> distributed/partitioned index that might be very useful as a generic 
>> distributed cache building block.
>> One thing that I do want wadi to do, is to still work when 
>> replication is switched off. i.e., if a session only exists as a 
>> primary copy, even if affinity breaks down, wadi will continue to 
>> correctly render requests for that session unless some form of 
>> catastrophic failure causes the session to evaporate. This means that 
>> I need to ensure the session's timely evacuation from a node that 
>> chooses to leave the cluster to a remaining node, so that it may 
>> remain active beyond the lifetime of its original node. All of this 
>> must work flawlessly under stress, so that an admin may add or remove 
>> nodes to a running cluster without having to worry about the user 
>> state that it is managing. Nodes are added by simply starting them, 
>> and nodes removed via e.g. ctl-c-ing them.
>> If it is decided that a few more nines are needed in terms of session 
>> availability and the cluster owner understands the extra cost 
>> involved in in-vm replication in terms of extra hardware and 
>> bandwidth that they will have to purchase and is happy to go with 
>> in-vm-replication, then it should be sufficient to up the number of 
>> replicated copies kept by the cluster from '0' to e.g. '2' and 
>> restart (It might even be possible to vary this setting on a node to 
>> node basis so that this change does not even involve a complete 
>> cluster cold start). WADI should deal with the rest.
>> So, I believe that I have a pretty clear idea of what WADI will do, 
>> and aside from the replication stuff (phase2) it currently does most 
>> of what iIhad in mind for phase1, except that it is not yet happy 
>> under stress. I figure it will probably take one or two more 
>> redesign/reimplementation iterations to get it to this stage, then I 
>> can consider replication.
>> I have spoken to members of the OpenEJB team about  wadi's ability to 
>> relocate requests as well as sessions and we came to the conclusion 
>> that it was just as applicable in the EJB world as the web world. If 
>> the node an ejb client is talking to leaves the cluster in between 
>> calls, the client may try to contact it and then failover to another 
>> node that it hopes holds the session. If, due to other nodes 
>> leaving/joining it is not always clear which node will contain the 
>> session, the ability to reply to an RMI and just say "not here - 
>> there!" - i.e. an rmi redirection - would not be hard to add and 
>> would resolve this situation. Transactions are another item which I 
>> have marked phase2.
>> So, I am trying hard to stay very focussed on the problem domain, 
>> otherwise this will never get finished :-)
>> Right, off to read those papers now - thanks for your posting and 
>> your interest,
>> Jules
>>> Hope this helps
>>> andy 

"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 * Open Source Training & Support.

View raw message