geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: Web Clustering : Stick Sessions with NO SHARED STORE
Date Thu, 30 Oct 2003 10:59:32 GMT
Bhagwat, Hrishikesh wrote:

>Ref: original proposal on AUTO-PARTITIONING
>
>just wanted to clarify that the back-up nodes (B and C) will not always
>unmarshall the sessions received from the Primary. The byte stream
>(containing all the marshalled sessions) from A would be saved into a file
>on a hard-disk (of B and C).
>
so you are replicating whole sessions - not deltas ?


Jules

> It is only iff Primary(A) goes down that the 
>secondary(B) will read the sessions and redistribute them amongst all servers
>(B-J). It is only after that, that each the servers will do the job of 
>unmarshalling these new sessions they have just received and which they are
>each supposed to regard from now on as NATIVE.
>
>regardless, schemes for lazy de-serialization, as the one mentioned below,
>would always be a big plus
>
>
>-----Original Message-----
>From: Jules Gosnell [mailto:jules@coredevelopers.net]
>Sent: Tuesday, October 28, 2003 10:16 AM
>To: geronimo-dev@incubator.apache.org
>Subject: Re: Web Clustering : Stick Sessions with NO SHARED STORE
>
>
>in the impl that I am working on I do this...
>
>every value in the session is wrapped by another object that controls 
>it's marshalling/demarshalling.
>
>something like :
>
>class Wrapper
>{
>transient boolean _isActive;
>transient Object   _activeValue;
>byte[] _passiveValue;
>
>Object getValue(){...}
>}
>
>You construct the Wrapper with an Object which becomes it's active value.
>
>When the Wrapper is serialised it does the following :
>
> - if _activeValue is an HttpSessionActivationListener, it calls 
>_activeValue.willPassivate()
>- if _activeValue is Serialisable it marshalls it into a byte[] and puts 
>this in _passiveValue
>- if _activeValue is a non-Serialisable type that J2EE mandates is 
>distributable (e.g. EJB) it sets _passiveValue to be e.g. it's handle 
>marshalled into a byte[]
>- it sets _isActive to 'false'
>
>when it is deserialised :
>
>- _isActive (transient) becomes 'false'
>- _activeValue (transient) becomes 'null'
>- _passiveValue is initialised to be a byte[]
>
>IFF Wrapper.getValue() is called, the process is reversed (the 
>_passiveValue is demarshalled into the _activeValue (or resolved via 
>it's e.g. handle) , notified if a Listener etc. and _isActive set to 'true'.
>
>This gives you
>
>- lazy deserialisation - you read your incoming data into a Wrapper 
>containing a single byte[], and only pay further if you ask for the 
>value, which should never happen (with my design) on a non-primary node, 
>and needn't happen on the primary node, since you already have a 
>demarshalled value in your hand (you will never receive updates from 
>other nodes because you are the only one allowed to write).
>
>- the ability to distribute non-Serialisable objects
>
>- correct notification of HttpSessionActivationListeners
>
>- somewhere to put other stuff like optimistic version checking, 
>compression (if you were deployed in a narrowband environment)  etc...
>
>- all this is done transparently - simply serialising a whole session 
>will cause this to be done to every attribute.
>
>The cost is :
>
>- a Wrapper aound each attribute (I could possibly make a runtime 
>decision as to whether a value needs wrapping - it would be a bit of a 
>waste of time around e.g. Integer)
>
>- sending an attribute becomes a little more expensive, since it is 
>marshalled once into the internal byte[] and then the Wrapper containing 
>this must again be marshalled.
>
>This is a possible solution, but I have yet to run any tests to 
>determine whether the saving on non-primary nodes will merit the extra 
>expense on the primary node. I figured that deserialisation would be 
>much more expensive than serialisation....
>
>It is possible, given a tight integration with the transport layer, that 
>I might be able to avoid this double serialisation on the primary node - 
>I am leaving this as a possible future optimisation.
>
>
>Jules
>
>
>
>
>Bhagwat, Hrishikesh wrote:
>
>  
>
>>i agree we must avoid demarshalling every incoming objects ... do you
>>have any seggestions on how we can do that ... 
>>
>>-----Original Message-----
>>From: Jules Gosnell [mailto:jules@coredevelopers.net]
>>Sent: Tuesday, October 28, 2003 12:03 AM
>>To: geronimo-dev@incubator.apache.org
>>Subject: Re: Web Clustering : Stick Sessions with NO SHARED STORE
>>
>>
>>Bhagwat, Hrishikesh wrote:
>>
>> 
>>
>>    
>>
>>>the impact of moving large number of session objects on the network 
>>>is a very obvious concern (expressed in the mail thread below)
>>>
>>>   
>>>
>>>      
>>>
>>if network bandwidth is the bottleneck, then compression is a possible 
>>solution. Bear in mind that you are trading cpu and memory requirements 
>>for bandwidth.
>>
>>demarshalling is an expensive thing to do and will affect backup nodes 
>>as well as target nodes in migration relationships.
>>
>>if a backup node can avoid demarshalling incoming objects, then this is 
>>a worthwhile optimisation.
>>
>>compression, which is used in JBoss(tm) SFSB replication is a useful 
>>weapon in the arsenal.
>>
>>
>>Jules
>>
>> 
>>
>>    
>>
>>>   
>>>
>>>      
>>>
>>>>- Session migration will be expensive and should be avoided when 
>>>>possible. In cases where it is necessary, e.g. [multiple] node failure 
>>>>may lead to a node carrying more state than it can handle, it should 
>>>>probably progress as a low priority background task, so that the cluster 
>>>>is not suddenly flooded with messages which simply encourage more 
>>>>failures etc... and already stressed nodes are not further pegged out by 
>>>>the cost of [de]marshalling 1000s of sessions - this could quickly lead 
>>>>to a domino-effect bringing down th whole cluster (perhaps we could 
>>>>ringfence 'super-partions' ?).
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>here is a simple solution (ofcourse not very innovative though) ... when ever

>>>there is a need to send real large number of session objects across the System

>>>can use data-compression techniques. Here is a sample program i used to demonstrate

>>>the immense space/network bandwidth saving that it can do to us. Just to give
some 
>>>stats ... 
>>>
>>>I made a vector of 1000 Strings. Then made 500 such vectors and then added them
all
>>>to a Hashtable. Then used the ObjectOutputStream to wriutre them to a file. The
File
>>>Size was 4,493 KB (4.4 MB) ... then i changed the code to zip the contents b4
writing
>>>them (see attachment) .. the file is now of size 43 KB!!!
>>>
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Jules Gosnell [mailto:jules@coredevelopers.net]
>>>Sent: Friday, October 24, 2003 6:25 PM
>>>To: geronimo-dev@incubator.apache.org
>>>Subject: Re: Web Clustering : Stick Sessions with NO SHARED STORE
>>>
>>>
>>>interesting - I'll digest it fully over the weekend...
>>>
>>>I spotted a couple of points on the first read though....
>>>
>>>- I don't think that there is very much difference between my and your 
>>>suggestions for auto-partitioning a cluster - it may well be possible to 
>>>make this a pluggable strategy, so that different criteria such as 
>>>'Internal Network traffic' could be used to make decisions.
>>>
>>>- If at all possible, the design should work with existing 
>>>load-balancers, such as the Cisco and F5 product lines, as well as Open 
>>>Source offerings such as mod_jk and mod_backhand. People may already 
>>>have well established architectures that they wish to fit Geronimo into. 
>>>A Geronimo-aware lb may be able to make optimisations that another lb 
>>>will not see, but the cluster should still perform reasonably with other 
>>>lbs.
>>>
>>>- Session migration will be expensive and should be avoided when 
>>>possible. In cases where it is necessary, e.g. [multiple] node failure 
>>>may lead to a node carrying more state than it can handle, it should 
>>>probably progress as a low priority background task, so that the cluster 
>>>is not suddenly flooded with messages which simply encourage more 
>>>failures etc... and already stressed nodes are not further pegged out by 
>>>the cost of [de]marshalling 1000s of sessions - this could quickly lead 
>>>to a domino-effect bringing down th whole cluster (perhaps we could 
>>>ringfence 'super-partions' ?).
>>>
>>>Whatever the final design, it needs to be as simple and as robust as it 
>>>possibly can be. Hybrid solutions, which take advantage of the strengths 
>>>of different approaches in exchange for a slight increase in complexity 
>>>are often useful (hence my replication AND shared store approach).
>>>
>>>Whatever replication medium or shared store implementation we use, 
>>>someone will always come along and want to use something else - JGroups, 
>>>JMS, DB, JavaSpaces, NFS etc..., so these all need to be pluggable.
>>>
>>>
>>>I will try to find the time over the weekend to write up my suggestion 
>>>for auto-partitioning/healing and send it to the list. Then we can 
>>>compare notes :-)
>>>
>>>Jules
>>>
>>>
>>>Bhagwat, Hrishikesh wrote:
>>>
>>>
>>>
>>>   
>>>
>>>      
>>>
>>>>Attached is the PROTOCOL that can help us implement 	
>>>>Clustering without a DB. The PROTOCOL uses the concept
>>>>of dynamic partitioning (inspired from Jules Gosnell's 
>>>>original idea). Also though it is not mentioned explicitly
>>>>in the document, many of the functions will operate using
>>>>JMX (thanks to Vivek Biswas :-) for enlightening me on this).
>>>>
>>>>Summary :
>>>>The protocol makes the system "dynamically change itself"
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>>from Jetty Like topology where "ALL SERVERS BACKING UP 
>>>
>>>
>>>   
>>>
>>>      
>>>
>>>>EACH OTHER" to a simple "PRIMARY - SECONDARY" topology
>>>>that contains only one backup per server. Thus as LOAD
>>>>increases partitions of ever diminishing sizes are created.
>>>>Since back-ups happen only within a partition the system
>>>>can scale well (also administrators can specify MAX_SERVERS_IN_CLUSTER
>>>>and MIN_SERVERS_IN_CLUSTER).
>>>>
>>>>The document is divided into 
>>>>
>>>>1. Concept
>>>>2. Case of Failover
>>>>3. Case of Node Added
>>>>4. Case of excessive load on a Server
>>>>
>>>>while the attachment has topics 1 and 2 , i am still writing the 
>>>>others. I though that topic-2 was particularly complex so i though 
>>>>of putting it up on the mailing-list for all to review and
>>>>digest it till the relatively simple (i guess) topics 3and4 are ready
>>>>
>>>>thanks
>>>>-hb
>>>>
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Jules Gosnell [mailto:jules@coredevelopers.net]
>>>>Sent: Thursday, October 23, 2003 11:55 PM
>>>>To: Bhagwat, Hrishikesh; Geronimo Developers List
>>>>Cc: Biswas, Vivek
>>>>Subject: Re: [Re] Web Clustering : Stick Sessions with Shared Store -
>>>>curr ent state of play.
>>>>
>>>>
>>>>I agree on needing to be able to run without a db.
>>>>
>>>>If you think about it, a db is just a highly specialised node in the 
>>>>cluster. It is much simpler, and therefore maintainable, if all nodes in 
>>>>a Geronimo cluster are homogeneous. We can't lose the db your business 
>>>>data lives in, but if we can avoid adding another for session 
>>>>replication it might be of advantage.
>>>>
>>>>To this end, my design should work without a db. You just tune it so 
>>>>that passivation never occurs - i.e. unlimited active sessions. You 
>>>>trade of in-vm space against db space.
>>>>
>>>>Likewise, if you were an advocate of the 'shared store' approach, you 
>>>>should be able to constrain the web container to keep '0' active 
>>>>sessions in memory, but passivate everything immediately to the db.
>>>>
>>>>So, yes, I'm sure we are on the same page.
>>>>
>>>>
>>>>Jules
>>>>
>>>>
>>>>
>>>>
>>>>Bhagwat, Hrishikesh wrote:
>>>>
>>>>
>>>>
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>>>hi Jules,
>>>>>
>>>>>I am happy that we are converging, in that, the following approach does
very
>>>>>well cover some of the main goal that i was trying to achieve when i wrote
my
>>>>>first proposal. I have marked those points below as [hb-X] with brief
comments.
>>>>>
>>>>>Thus keeping the "Goal of the Architecture" the same I would like to propose
a new
>>>>>scheme for "replication". While initially, I was keen on having NO DATA
EXCHANGE BETWEEN 
>>>>>SERVERS but rather on having them all just use the shared store to persist
and retrieve session
>>>>>data, you were interested in the contrary. Your approach (mentioned in
section 2 of http://wiki.codehaus.org/geronimo/Architecture/Clustering) is about Session object
exchange
>>>>>between servers which are clustered and about clusters that are partitioned
(stat/dynamically).
>>>>>I think the discussion there stops with some issues that you think that
may arise with that
>>>>>kind of a scheme.
>>>>>
>>>>>After Vivek Biswas first pointed out to me, I have been feeling increasing
uncomfortable with the idea of
>>>>>Geronimo being DEPENDENT on an EXTERNAL system like a Database for implementing
its web clustering.
>>>>>Some ppl on the mailing list have spoken about performanced issues with
the DB approach but i 
>>>>>think with techniques like asny-write-to-DB etc such problems can be circumvented.
Thus though I 
>>>>>still believe that its a solution that can solve the problem, I am not
comfortable with the usage of 
>>>>>an external system to aid in clustering. I dont know of any GOOD (hi-av)
DBs that comes for FREE. This
>>>>>mean that even if Geronimo is available for FREE it cant be used without
purchasing a DB. The solution
>>>>>as a whole, is then, not truely FREE-of-cost.
>>>>>
>>>>>I have been since then looking at many alternatives ... like the Jetty
implemented solution of 
>>>>>having all m/c in one cluster to a 1Primary:1Secondary solution. 
>>>>>
>>>>>Presenly I have something cooking. As I come close to working on details
I find it quite similar to 
>>>>>your original solution of dynamic patitioning. This is what again convinces
me that we seem to converge.
>>>>>my porposal in to exchange note and to jointly come up with a detailed
design.
>>>>>
>>>>>Do let me know your thoughts on that .... I am tring to complete a document
on this new schema ASAP and will
>>>>>mail you the same.
>>>>>
>>>>>thanks
>>>>>- hb
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Jules Gosnell [mailto:jules@coredevelopers.net]
>>>>>Sent: Thursday, October 23, 2003 3:44 AM
>>>>>To: geronimo-dev@incubator.apache.org
>>>>>Subject: Re: [Re] Web Clustering : Stick Sessions with Shared Store -
>>>>>current state of play.
>>>>>
>>>>>
>>>>>Guys,
>>>>>
>>>>>since this topic has come up again, I thought this would a useful point

>>>>>to braindump my current ideas for comment and as a common point of 
>>>>>reference...
>>>>>
>>>>>Here goes :
>>>>>
>>>>>
>>>>>
>>>>>Each session has one 'primary' node and n 'replicant' nodes associated
>>>>>with it.
>>>>>
>>>>>Sticky load-balancing is a hard requirement.
>>>>>
>>>>>Changes to the session may only occur on the primary node.
>>>>>
>>>>>[hb-1] i always intended only the OWNER NODE (remember ...  i wasn't keen
on having 
>>>>>Primary/Sec. concept so i am using this term)to make changes to the SESSION.
Also
>>>>>only ON CHANGE the session would be sent for persistance.
>>>>>
>>>>>
>>>>>Such changes are then replicated (possibly asynchronously, depending
>>>>>on data integrity requirements) to the replicant nodes.
>>>>>
>>>>>[hb-2] My approach was about NOT USING REPLICATION AT ALL but using 
>>>>>a shared data store. Though I never mentioned it explicitly (bocz I 
>>>>>regarded this as a finer detail) ... i always intended on having an 
>>>>>async sys do the persistence.
>>>>>
>>>>>If, for any reason, a session is required to 'migrate' to another node
>>>>>(fail-over or clusterwide-state-balancing), this 'target' node makes a
>>>>>request to the cluster for this session, the current 'source' node
>>>>>handshakes and the migration ensues, after which the target node is
>>>>>promoted to primary status.
>>>>>
>>>>>[hb-3] I am not sure how a target server can initiate a migration process
>>>>>I would imagine that on a fail-over or on a systematic-removal of a node

>>>>>or any other such action that requires a cluster wide state-balancing,
it is
>>>>>the lb/adminS that would sense this first and initiate actions.
>>>>>
>>>>>Any inbound request landing on a node that is not primary for the
>>>>>required session results in a forward/redirect of the request to it's
>>>>>current primary, or a migration of the session to the receiving node
>>>>>and it's promotion to primary.
>>>>>
>>>>>[hb-4] Not sure when this scenario would occur when a "secondary" would
>>>>>receieve a HTTP request even when the primary is functionning well.
>>>>>
>>>>>A shared store is used to passivate sessions that have been inactive
>>>>>for a given period, or are surplus to constraints on a node's session
>>>>>cache size.
>>>>>
>>>>>[hb-5] yes this is very much a point that I have been saying since my

>>>>>first proposal. Just to quote from that doc. "With a little intelligence

>>>>>built in an MS can store away, less busy sessions to DB and retrieve them

>>>>>when needed thus offering something that is near to "virtually unlimited

>>>>>amount of sessions (section 1.1.1.1)"
>>>>>
>>>>>Once in the shared store, a session is disassociated from it's primary
>>>>>and replicant nodes. Any node in the cluster, receiving a relevant
>>>>>request, may load the session, become it's primary and choose
>>>>>replicant nodes for it.
>>>>>[hb-6] this is a good optimization to sit on the scheme mentioned in [hb-5]
>>>>>
>>>>>Correct tuning of this feature, in a situation where frequent
>>>>>migration is taking place, might cut this dramatically.
>>>>>
>>>>>
>>>>>The reason for the hard node-level session affinity requirement is to
>>>>>ensure maximum cache hits in e.g. the business tier. If a web session
>>>>>is interacting with cached resources that are not explicitly tied to
>>>>>it (and so could be associated with the same replicant nodes), the
>>>>>only way to ensure that subsequent uses of this session hit resources
>>>>>in these caches is to ensure that these occur on the same node as the
>>>>>cache - i.e. the session's primary node.
>>>>>
>>>>>By only having one node that can write to a session, we remove the
>>>>>possibility of concurrent writes occurring on different nodes and the
>>>>>subsequent complexity of deciding how to merge them.
>>>>>
>>>>>[hb-7] I complete agree
>>>>>
>>>>>The above strategy will work for a 'implicit-affinity' lb (e.g. BigIP),
>>>>>which remembers the last node that a session was successfully accessed
>>>>>on and rolls this value forward as and when it has to fail-over to a
>>>>>new node. We should be able to migrate sessions forward to the next
>>>>>node picked by the lb, underneath it, keeping the two in sync.
>>>>>
>>>>>With an 'explicit-affinity' lb (e.g. mod_jk), where routing info is
>>>>>actually encoded into the jsessionid/JSESSIONID value (or maybe an
>>>>>auxiliary path param or cookie), it should be possible, in the case of
>>>>>fail-over, to choose a (probably) replicant node to promote to primary

>>>>>and to stick
>>>>>requests falling elsewhere to this new primary by resetting this
>>>>>routing info on their jsessionid/JSESSIONID and redirecting/forwarding
>>>>>them to it.
>>>>>
>>>>>If, in the future, we write/enhance an lb to be Geronimo-aware, we can
>>>>>be even smarter in the case of fail-over and just ask the cluster to
>>>>>choose a (probably) replicant node to promote to primary and then direct

>>>>>requests
>>>>>directly to this node.
>>>>>
>>>>>The cluster should dynamically inform the lb about joining/leaving
>>>>>nodes, and sessions should likewise maintain their primary/replicant
>>>>>lists accordingly.
>>>>>
>>>>>[hb-8] I complete agree
>>>>>
>>>>>LBs also need to be kept up to date with the locations and access
>>>>>points of the various webapps deployed around the cluster, relevant
>>>>>node and webapp stats (on which to base balancing decisions), etc...
>>>>>
>>>>>All of this information should be available to any member of the
>>>>>cluster and a Geronimo-aware lb should be a full cluster member.
>>>>>
>>>>>On shutting down every node in the cluster all session state should
>>>>>end up in the shared store.
>>>>>
>>>>>
>>>>>These are fairly broad brushstrokes, but they have been placed after
>>>>>some thought and outline the sort of picture that I would like to see.
>>>>>
>>>>>Your thoughts ?
>>>>>
>>>>>
>>>>>Jules
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>Bhagwat, Hrishikesh wrote:
>>>>>
>>>>>
>>>>>
>>>>> 
>>>>>
>>>>>    
>>>>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>>>>>>I am also not convinced it reduces the amount of net traffic.
After each
>>>>>>>request the MS must write to the shared store, which is the same
traffic as
>>>>>>>a unicast write to another node or a multicast write to the partition
>>>>>>>(discounting the processing power needed to receive the message).
>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>     
>>>>>>>
>>>>>>>        
>>>>>>>
>>>>>>>           
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>I agree. However, this is based on the assumption that only one unicast

>>>>>>write is required. In other words, this is a primary/secondary topology.
I 
>>>>>>think that hd did not intended such a topology and hence his statement.
>>>>>>
>>>>>>[hb]  Yes i was not assuming a Pri/Sec design but a layout where any
active server
>>>>>>	can be request to pick up a client request which is destined to server
that has just failed
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: gianny DAMOUR [mailto:gianny_damour@hotmail.com]
>>>>>>Sent: Sunday, October 19, 2003 7:35 AM
>>>>>>To: geronimo-dev@incubator.apache.org
>>>>>>Subject: [Re] Web Clustering : Stick Sessions with Shared Store
>>>>>>
>>>>>>
>>>>>>Jeremy Boynes wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   
>>>>>>
>>>>>>      
>>>>>>
>>>>>>         
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>However, as Andy says, the cost of storing a serialized object
in a BLOB is
>>>>>>>significant. Other forms of shared store are available though
which may
>>>>>>>offer better performance (e.g. a hi-av NFS server).
>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>     
>>>>>>>
>>>>>>>        
>>>>>>>
>>>>>>>           
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>Do we need a shared repository or a replicated repository?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   
>>>>>>
>>>>>>      
>>>>>>
>>>>>>         
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>The issue I have with hb's approach is the reliance on an Admin
Server, of
>>>>>>>which there would need to be at least two and they would need
to co-operate
>>>>>>>between themselves and with any load-balancers. I think this can
be handled
>>>>>>>by the regular servers themselves just as efficiently.
>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>     
>>>>>>>
>>>>>>>        
>>>>>>>
>>>>>>>           
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>I agree. It seems that in such a design an Admin Server is only used
to 
>>>>>>route incoming requests to the relevant node.
>>>>>>
>>>>>>However, I do not believe that regular servers can do this job. I
assume 
>>>>>>that they will implement a standard peer-to-peer cluster topology
to provide 
>>>>>>redundancies, however I do not see how they can handle the dispatch
of 
>>>>>>incoming requests.
>>>>>>
>>>>>>This feature seems to be either a client or a proxy one: I mean it
should be 
>>>>>>done prior to reach the nodes.
>>>>>>
>>>>>>For instance, this feature is treated on the client-side via a stub
aware of 
>>>>>>the available nodes in WebLogic. It seems that JBoss (correct me if
I am 
>>>>>>wrong) has also followed this design.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   
>>>>>>
>>>>>>      
>>>>>>
>>>>>>         
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>I am also not convinced it reduces the amount of net traffic.
After each
>>>>>>>request the MS must write to the shared store, which is the same
traffic as
>>>>>>>a unicast write to another node or a multicast write to the partition
>>>>>>>(discounting the processing power needed to receive the message).
>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>     
>>>>>>>
>>>>>>>        
>>>>>>>
>>>>>>>           
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>I agree. However, this is based on the assumption that only one unicast

>>>>>>write is required. In other words, this is a primary/secondary topology.
I 
>>>>>>think that hd did not intended such a topology and hence his statement.
>>>>>>
>>>>>>Gianny
>>>>>>
>>>>>>_________________________________________________________________
>>>>>>MSN Search, le moteur de recherche qui pense comme vous !  
>>>>>>http://search.msn.fr/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   
>>>>>>
>>>>>>      
>>>>>>
>>>>>>         
>>>>>>
>>>>>>            
>>>>>>
>>>>> 
>>>>>
>>>>>    
>>>>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>>>------------------------------------------------------------------------
>>>>
>>>>
>>>>Imagine a Cluster having N managed servers providing complete redundancy to
each other. Imagine that they are all arranged in a cirlce. 
>>>>
>>>>They continue to be one big cluster till the Internal Network traffic (INT)
is less than a certain thereshold. At this stage any server in the cluster is just as good
as any other server in the cluster to provide fail-over support. This scheme cannot however
scale well largely due to 2 factors
>>>>
>>>>1. As client requests increase the number of server-to-server exchanges for
Session replication would
>>>>increase, stressing the network.
>>>>2. There would be an added overhead for each server to process ALL of these
Sessions objects.
>>>>
>>>>
>>>>                   A ----------------- B ------------------- C -----------
D
>>>>                   |     ---->                                        
     |
>>>>                   |                                                     
  |
>>>>                   |                                                     
  |
>>>>                   J                                                     
  E
>>>>                   |                                                     
  |
>>>>                   |                                                     
  |
>>>>                   |                                                     
  |
>>>>                   I ----------------- H -------------------- G -----------
F
>>>>
>>>>
>>>>Here is a solution :
>>>>-------------------
>>>>After the INT crosses a certain threshold value, "A" stops sending Session-updates
to "the server before itself" (J). It also sends a notification to J to that effect. J can
now forget all the session of A that it had stored up untill now and thus get a chunk of free
memory. Each server does the same thus "B cuts off A" and "C cuts off B" and so on. The number
of exchanges now reduces from (n * n) to (n * (n-1)). In case of failure "A" still has a solid
backup (All server from B to I).
>>>>
>>>>After somemore increase in the traffic the servers "PERGES" yet another server.
Thus this time "A" will stop sending Session Objects to I (and ofcourse J). Follwing that
it will send a notification (YOU_ARE_PURGED)to I . On the other hand it "A" would have received
a similar notification, this time, from C. However A still has a strong enough backup (all
servers from B to H) to fall on.  Every machine holds with itself a list of ALL peers (A-B-C-
.. in the correct order) and the recent "cutoff-count (no. of servers purged - this value
is obviously the same for all the servers)".
>>>>
>>>>
>>>>Note that, in a way partitions are being created in the single large cluster.
"A" OWNS a particition that originally contained B-J. It also is a member of 9 other particition
owned respectively by B,C...J. As traffic and cliet-load increased A starts pushing out members
of its partition thus making it smaller. In effect it gets removed from some other server's
(say B .. for cutoff_count = 1 and from C when cutover_count = 2) partition. 
>>>>
>>>>As soon as a server is PURGED (goes out of a partition): 
>>>>
>>>>1. It need not any more receieve session-updates from the Partition OWNER
- saving network bandwidth
>>>>
>>>>2. It need not remember any Session objects that were earlier given to it
by THAT partition owner - thus allowing it to have more free memory - to be utilized for serving
the increased client load.
>>>>
>>>>
>>>>Under this scheme, I think, the Cluster will have a fairly uniform turn around
time over a large range of "client load".
>>>>
>>>>Even as all of this happens the LB will continue allocate NEW CLIENTS in a
simple round robin fashion while forwarding requests comming from all other clients, who already
have established session with a server, to their respective servers.
>>>>
>>>>
>>>>Now we shall discuss 4 special cases
>>>>
>>>>1. Failover 
>>>>
>>>>
>>>>Now imagine that "A" (actully any randomly selected server) fails. When LB
finds out that it is not able to reach "A" it simple forwards the client request (destined
for A) to the next available server in the list, "B". B tries to serve the request but doesnt
find the SessionID in its list of "NATIVE" sessions. THIS IS A SIGNAL TO "B" that "A"HAS failed.
Now it goes on to do the following CRUTIAL ACTIVITY :
>>>>
>>>>It goes to its local Bucket that was dedicated to hold A's Sessions. It will
scope out a collection of these for EACH SERVER IN THE CLUSTER (not just for the ones in the
partition) ... So even if "A" owned a partition having A-B-C, yet the Session objects will
be divided into 8 (B-J) sets. The SETS are now sent out to their respective owners. A special
protocol (may be a different Queue/Topic) may be used to specify to the receipients that they
are supposed to HEREAFTER treat these Sessions as being native to them (as if they have now
being promoted as the Primary Node for the Sessions they just received). While B is doing
this .. it may continue to get requests from A's clients (since LB simply forwards them).
TILL SUCH TIME THAT B HAS NOT SENT OUT THE "SETS" IT WILL CONTINUE TO SERVICE SUCH REQUESTS.
ALSO SUCH SESSIONS WOULD BE REMOVED FROM THE SETS AND WOULD THEREAFTER BE TREATED AS NATIVE
TO B. After B sends out the SETS, it waits for acknowledgements from the servers. Before getting
the ACK if it receives any more requests from A's clients, "B" will service that request and
send the updated Session to the owner(say F). If after a stipulated amount of time "B" still
doesnt receieve any ACKs it will distribute F's SET amongst the rest. For distribution it
will again follow the same process. It will send F a signal to IGNORE its MAKE_NATIVE call
(through which it had earlier handed over the Sesssions). 
>>>>
>>>>When a server acknowledges, B sends an UPDATE_LIST (sessions-newOWNER)to the
LB. LB is hereon expected to send the requests to the new OWNERS. "B" however continues to
maintain this (UPDATE_LIST) list with it. It will keep this list with it till the time LB
does not ACK having received the UPDATE_LIST. The reason to do so is in the folliwng special
case : "B" has sent the SETs and the UPDATE-REPORT(LB) ... but LB has not yet received it.
Just then a request appears from A's client. LB still doesnt know about the new ownership
of this SESSION by another server ("D") and so continues with its policy to forwards it to
B. On receival B, finds out that it has received a call for A. It check for the correct owner
of the session and forwards the call to it.
>>>>
>>>>By now all the Machines have shared the load of A and can continue serving
all clients, it is a good time now for the system to re-organize it self. what this means
is that; Partitions IJA and JAB are not complete (bocz "A" has failed) also partition ABC
no more exists. The seazure of ABC doesnt really matter but for the first two partitions (IJA
and JAB ) it means that their owners, respectively I and J, do not have enough back up servers.
Instead of two back up servers as every one else has, they have just one available. Thus the
SYSTEM needs to reorganize.
>>>>
>>>>In the current example we have said that the cutover_count is 3. The system
can take the new LIST OF SERVERS B-C-D-E-F-G-I-J and apply this cutover_count. BUT since one
server has gone down they would each have to service more load thus the cutover_count is cut
by ONE. 
>>>>
>>>>
>>>>IMP NOTE: a system administrator can specify the MINIMUM and the MAXIMUM value
for cutover_count. LOWER THE CUTOVER_COUNT higher the performance (but lesser the number of
servers backing up one (each) server). On the other hand you may have many servers backing
up ONE (each) server .. thus assuring higher failover capacity but at lower performance.
>>>>
>>>>
>>>>After getting ACK from all servers and all correspondign UPDATE-LIST-ACK from
the LB, B will send out a RE_ORGANIZE signal with LIST as B-C-D-E-F-G-I-J and 
>>>>
>>>>	cutover_count = (cutover_count > MIN_ALLOWED_CUTOVER) ? cutover_count
-1 : MIN_CUTOVER
>>>>
>>>>to all the SERVERS
>>>>
>>>>NOTE : In this case, all through out B is acting as an ADMIN SERVER. But its
selection is DYNAMIC. If server C would have gone down "D" would have been the ADMIN. This
design thus does not have a STATICALLY CONFIGURED ADMIN on whose failure the system cannot
run.
>>>>
>>>>
>>>>ANOTHER NOTE : if "C" receives a request that was ment for "A" (and NOT B)
then that means that both A and B failed (almost simultaneously). C must then execute the
algo explained above for BOTH the servers.
>>>>
>>>>
>>>>How the Cluster re-organizes (actually re-partitions), we shall see in the
next section. However one case needs to be explained before that. All the discussion above
pertains to the case where failover of A was follwed immidiately by a client request for A.
The following is a discussion on how the system shall behave for requests comming in for other
servers
>>>>
>>>>1.> B,C,D,E,F,G,H (Servers which do NOT have A as a part of partitions
they OWN) : Processing happens as normal. Capability to back them up is not compromized by
A going down.
>>>>
>>>>2.> I,J (Servers that DO CONTAIN A as a part of the partition they OWN)
: After processing the client request they shall find out, while trying to send the updated
SESSION to back-up servers, that one of their backup servers(A) has failed. Since each server
has a list of all peers (and in the correct order) they can ask the server (B) following the
one that failed (A) to take corrective action. B will then execute the same process that is
mentioned above ending eventually with a call to RE_ORG. It is only after RE_ORG that I and
J will have a second back up to save their SESSIONS. Meanwhile they will continue to be backup
on ONE backup server that is still available.
>>>>
>>>>	
>>>>Reorganization
>>>>--------------
>>>>
>>>>Reorganization happens at all Servers simulataneously. It is a simple process.
A server (say G) receives a RE_ORG signal from B. Along with this signal it receives the new
SERVER_LIST (B-C-D-E-F-G-H-I-J) and then new cutover_count (say 3-1 = 2). G will then undertake
this two step process
>>>>
>>>>1. Here after send all updates to the session (including the new ones it has
just received from B; A's sessions that B distributed to it) to ONLY its new member(s). In
this case "H".
>>>>
>>>>2. It will send a YOU_ARE_PERGED signal to all other signals (meaning that
they are no a part of G's partition).
>>>>
>>>>
>>>>This process happens at each server. Thus J will have I as backup and I will
have B.
>>>>
>>>>Note: If in case the MIN_ALLOWED_CUTOVER_COUNT is 3. B cannot send the new
cutover_count of (2). It will continue with the then current value of 3. thus new partitions
would be like GHI, HIJ, IJB, JBC and so on.
>>>>
>>>>
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>   
>>>
>>>      
>>>
>> 
>>
>>    
>>
>
>
>  
>


-- 
/*************************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 * http://www.coredevelopers.net
 *************************************/



Mime
View raw message