geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: Clustering - JGroups issues and others
Date Wed, 19 Oct 2005 10:50:30 GMT
Thanks for coming back, Valeri.

You have put your finger fairly and squarely on the cluster 
implementer's nightmare :-)

This really is a thorny problem which I keep coming back to. I'm 
assuming that if the cluster becomes fragmented into different subgroups 
(that map to h/w enclosures etc.) and that if they can all still see 
common backend servies, but not other peer groups, then the e.g. h/w 
load-balancer in a web-deployment may still be able to see all nodes in 
all groups ? Since traffic is still arriving at more than one cluster 
fragment, all sorts of problems may arise.

I guess WADI might do something like this :

The cluster fragments...

Each fragment would find that it had an incomplete set of 
buckets/partitions (WADI's architecture is to partition the session 
space into a fixed number of buckets and share responsibility for these 
between the cluster members).

Each fragment would have to assume that the missing partitions had been 
lost and would not be rejoining (in case this were really the case), so 
the missing partitions would have to be resurrected and repopulated with 
sessions drawn from replicated copies. Thus each fragment would end up 
with a complete set of partitions.

Each fragment would be likely to end up with an incomplete session set 
that intersected with the session set held by other fragments (since it 
is likely that not all sessions could be resurrected, and some would be 
resurrected within more than one fragment).

Assuming (and I think we would have to make this a hard requirement) 
that the load-balancer supported session affinity correctly, requests 
would continue to be directed to the node holding the original (not 
resurrected) version of their session.

So, at this point, we have survived the fragmentation and we are still 
fully available to our clients, although there may have been quite a lag 
whilst partitions were rebuilt/repopulated and the footprint of each 
node has probably increased due to each fragment carrying a larger 
proportion of the original cluster's sessions than it was originally 
(the session sets intersect).

Then, the network comes back :-)

Each fragment would become aware of the other fragments. Multiple copies 
of partitions and sessions would now exist within the same cluster.

Multiple instances of the same partition can be merged by simply taking 
the union of the session sets that they manage.

Merging multiple instances of the same session is a bit more awkward. if 
sessions carried some sort of version (HttpSessions carry a 
LastAccessedTime field), then all instances with the same 'version' can 
be collapsed. I guess we then move on to a pluggable strategy of some 
sort. The simplest of these would probably just assume that only one 
session would have been involved in a dialogue with the client since the 
fracture, since the client was 'stuck' to its node. If this is the case, 
then sessions with the lower version will all be snaphots of the 
original session taken at the point of fracture and will not have 
diverged further and so may be safely discarded (we may be able to try 
to remember/deduce the time of fracture and discard any session with a 
LAT before that point), leaving the original session only to continue. 
If divergance has occurred, then some custom, application space code 
might be run that can use application-level knowledge to merge the 
various session versions. But I think that if we have got to this stage, 
then we are in real trouble and should probably just declare an error 
and drop the session.

None of this is yet implemented in WADI, but it is stuff that I 
dream/have-nightmares about when I get too geeky :-) I hope to put some 
of this fn-ality in at some point.


What sort of frequency might this type of scenrio occur with ? It will 
be a lot of work to protect against it, but I realise that a truly 
enterprise-level solution must be able to survive this sort of thing.

If anyone else has had thoughts about surviving cluster fragmentation, I 
would be delighted to hear them.



Jules



Valeri.Atamaniouk@nokia.com wrote:

>Hello there... 
>
>Answers/comments are down...
>
>-valeri
>
>  
>
>>-----Original Message-----
>>From: ext Jules Gosnell [mailto:jules@coredevelopers.net] 
>>Sent: 18 October, 2005 17:41
>>To: dev@geronimo.apache.org
>>Subject: Re: Clustering - JGroups issues and others
>>
>>Valeri.Atamaniouk@nokia.com wrote:
>>
>>    
>>
>>>Hello
>>>
>>>Here is my 5 cents... I have some comments regarding clustering based 
>>>on J-Groups. We were trying to use this technology and came 
>>>      
>>>
>>to certain 
>>    
>>
>>>points, that render it unusable in our case.
>>>
>>>Many of the cluster caches/replicates assume that all the information 
>>>propagated to all the nodes in the cluster. Some of the solutions 
>>>propagate only keys, however. In any case this solution can 
>>>      
>>>
>>not be used 
>>    
>>
>>>in sufficiently large clusters as the rate of upates would 
>>>      
>>>
>>eat all the 
>>    
>>
>>>node capacity making it unusable.
>>> 
>>>
>>>      
>>>
>>This is the dreaded 1->all replication that is a popular 
>>implementation at the moment. See my previous mail about 
>>wadi's avoidance of this giving it a significant advantage 
>>over such solutions, in terms of scalability.
>>
>>    
>>
>>>Regarding J-Groups itself. Probably that is specific to cluster
>>>facilities in JBoss, but generally J-Groups organize a list of nodes,
>>>and every node checks the state of the next one in the chain.
>>>
>>>      
>>>
>>I wasn't sure how it worked... interesting ...
>>We should look into how membership is tracked by ActiveCluster.
>>
>>    
>>
>>>The
>>>problem is that in many cases servers may fail/disconnect in groups,
>>>which causes two problems: the segmentation of the cluster 
>>>
>>>      
>>>
>>cluster segmentation is a really tricky issue :-( - do all the 
>>segments 
>>then try to arrange themselves into smaller clusters, shifting 
>>loads of 
>>state around, or is jgroups smart enough to put all the pieces back 
>>together before passing control back to the application ?
>>
>>    
>>
>
>It is the problem of "homogenius" environment. Blade servers are
>naturally organized in chassis called enclosures (HP's term) or
>bladecenters (IBM's one). All those chassis are interconnected with each
>other using one or more external switches. Due to the fact that larger
>solutions tend to use multiple VLANs, the failure of each of them can be
>independent from the others. So it can occur, that the group of nodes
>will loose connectivity to all the rest of the cluster for some time (HA
>implied), but will see each other and also other backend services like
>database etc.
>
>That effectively leads to a situation that instead of single cluster we
>get two (three, four) smaller ones. If applicatioins are distributing
>services automatically the final result is a mess. Imagine a service
>that has to run on a single node, but starts on two or more...
>
>The JGroups merges the groups together after network recovery (note that
>JBoss sometimes doesn't - quite buggy), but the harm has been done
>already.
>
>  
>
>>>and
>>>extremelly high failure report time, as for architectures 
>>>      
>>>
>>based on blade
>>    
>>
>>>technology servers shut down in large packs
>>>
>>>      
>>>
>>do these 'packs' correspond to racks ? I have plans (NYI) for 
>>pluggable 
>>algorithms that will allow WADI to choose e.g. nodes in other 
>>racks, on 
>>other power sources, in other buildings etc as replication partners, 
>>otherwise you will lose state in a situation like this, if you 
>>happen to 
>>have yours backed up on to the node next to you in the same rack...
>>
>>    
>>
>
>Blade chassis (encusures, bladecenters, etc). Also see above. The issue
>is that the network failover technologies have a certain reaction time:
>from few seconds to minutes depending on the technology used.
>
>
>  
>
>>>and it really takes time to
>>>detect several sequentally disconnected servers.
>>> 
>>>
>>>      
>>>
>>What sort of lag are we talking about - a few seconds, or a 
>>few tens of 
>>seconds ?
>>
>>    
>>
>
>Up to few minutes. For a cluster of 14 machines when 10 of them povered
>down (through management interface) it can take several minutes. The
>default transport there is TCP which adds own problems, as TCP timeouts
>are huge (epecially in wireless environment using wTCP settings ;-) ).
>
>  
>
>>>To overcome the problems we ended up with the "star" 
>>>      
>>>
>>architecture, where
>>    
>>
>>>the central node is responsible for maintaining the list of 
>>>      
>>>
>>other nodes.
>>    
>>
>>>The availability of the central node itself could be provided with
>>>facilities like Red Hat Cluster Suite or similar (service failover,
>>>floating IPs, etc).
>>> 
>>>
>>>      
>>>
>>Hmmm.. - I understand why you went for this architecture, but I would 
>>prefer to find one that is homogeneous - i.e. we don't need a special, 
>>non-standard configuration for the central node. Deployment is much 
>>easier if every node has the configuration. Still, this is good input 
>>and has got me thinking in a direction which I had not really 
>>considered 
>>before.
>>    
>>
>
>I have no intention to force you for using our solution at all. Just
>some points for the cases when such a solution is not applicable. 
>
>  
>
>>Thanks, Valeri,
>>
>>Jules
>>
>>    
>>
>>>-valeri
>>> 
>>>
>>>      
>>>
>>-- 
>>"Open Source is a self-assembling organism. You dangle a piece of
>>string into a super-saturated solution and a whole operating-system
>>crystallises out around it."
>>
>>/**********************************
>>* Jules Gosnell
>>* Partner
>>* Core Developers Network (Europe)
>>*
>>*    www.coredevelopers.net
>>*
>>* Open Source Training & Support.
>>**********************************/
>>
>>
>>    
>>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/


Mime
View raw message