Return-Path: Delivered-To: apmail-incubator-geronimo-dev-archive@www.apache.org Received: (qmail 71196 invoked from network); 31 Oct 2003 01:12:33 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 31 Oct 2003 01:12:33 -0000 Received: (qmail 46304 invoked by uid 500); 31 Oct 2003 01:12:06 -0000 Delivered-To: apmail-incubator-geronimo-dev-archive@incubator.apache.org Received: (qmail 46230 invoked by uid 500); 31 Oct 2003 01:12:05 -0000 Mailing-List: contact geronimo-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: geronimo-dev@incubator.apache.org Delivered-To: mailing list geronimo-dev@incubator.apache.org Received: (qmail 46185 invoked from network); 31 Oct 2003 01:12:05 -0000 Received: from unknown (HELO visa.com) (198.80.42.3) by daedalus.apache.org with SMTP; 31 Oct 2003 01:12:05 -0000 Received: from ([10.72.11.168]) by portal3.visa.com with ESMTP ; Thu, 30 Oct 2003 17:11:24 -0800 (PST) Received: by sw720x001.visa.com with Internet Mail Service (5.5.2653.19) id ; Thu, 30 Oct 2003 17:11:55 -0800 Message-ID: <284BFFB1223B3F4598A3A388EEFDAFC6541E21@SW720EX015.visa.com> From: "Bhagwat, Hrishikesh" To: "'geronimo-dev@incubator.apache.org'" Subject: RE: Web State Replication... (long) Date: Thu, 30 Oct 2003 17:11:47 -0800 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N hey jules, Happy to see your design .. we are now good to compare notes .. infact I and vivek did just that during this morning and here are some comments. I did like to say that there are really 3 "independent" issues involved. 1. The issue of "CREATING and REORGANIZING" partitions and "NODE BACKUP" 2. The issue of Session Migration in face of a Failover. 3. The issue of integration with LB. In summary here is what i have to say : So far as point (1) is concerned, I think that this proposal is more of a "special case" scenerio of the generic approach mentioned in my proposal. please see section "A" below for an explanation to this comment So far as point (2) is concerned, while reading through the proposal and comparing it with my proposal we did realize that domino effect that a failover can have (in my design)... we have come up with a way of executing LAZY-REDISTRIBUTION. see section "B" So far as point (3) is concerned, Because our designs are so lot similar ... i think any good thoughts on this point can be incorporated in either of the designs just as well. Section "A" ============ The way your design (i like to call this "I") appears to me as a "special case" of the "auto-partitioning" design (will call this "II") that I had earlier put forward. In other words the "I" DESIGN "can be DERIVED" from the "II" desing. Please consider this If we finally implement the system using the "II" and have the following set of values then we will come to STATE-(1) of the "I" design PARAM ------ Node List : RED - GREEN - YELLOW cutover_count : 3 (current partition size) MAX_CUT_OVER : 4 (partitions cannot have more then 4 servers) MIN_CUT_OVER : 3 (partitions cannot have less than 3 servers) Now when you add BLUE Node the resultant state-(2)will arise from having Node List : RED - GREEN - YELLOW - BLUE cutover_count : 4 (current partition size) MAX_CUT_OVER : 4 (partitions cannot have more then 3 servers) MIN_CUT_OVER : 3 (partitions cannot have less than 3 servers) Thus till this point "I" can be completely derived from "II". Further due to its dynamic nature, "II" has a the following distinctive advantages. 1. "The amount of processing each Node of a partition must undertake in order to fully back up all nodes of that partition depends on the number of servers in that partition" .. in other words a each node of a partition that has just 3 servers will spend (relatively) less time doing back-up activity than a node that is a member of a partition having 8 nodes. LESS THE NUMBER OF SERVERS IN A PARTITION, MORE IT IS THAT EACH SERVER CAN CONCENTRATE ON SERVING CLIENT REQUESTS THAN ON BACKUP-OPERATIONS. thus improving performance .. but loosing robustness (fewer backups). MAX_CUT_OVER lets us specify the maximum number of server that can be allowed in a Partition thus letting you specify the LOWER BOUND ON PERFORMANCE. similarly MIN_CUT_OVER lets you put LOWER bound on ROBUSTNESS. I think we must thus stick to the approach where the system sway between MIN and MAX values depending upon load. Section "B" ============ The problem with the RECOVERY process in my initial proposal was that on failure of a Node A ... its sessions will be sliced and each slice would be sent out a specific node in the cluster. Till this point I think all would be ok ... problems would start when each node would start to back up these newly received nodes on to other nodes. Increasing the traffic multifolds. I and Vivek have thought of a solution for LAZY-REDISTRIBTION. The idea is that say "A" fails and so any further requests for A are sent to B. As in the earlier proposal ... B will service this request and simulteneously start a process to slice A's backup into no of live servers. It will then make a map of SLICE and SLICE_OWNERS (nodes). Now when another of A's request hits "B" it will check the slice to which this Session object belongs to. If the slice to which it belongs is owned by say D ... B will forward to "D" this request and also ITS SLICE OF A's sessions. Hereo-on D will take up the responsibility of serving all SESSIONS that fall in the SLICE it was supplied with. It will inform the LB about this. Once this is done B need not any more hold D's slice in memory. Thus the process of distribution of SLICES will be spread over a substantial time period. Finally, the reason why i am keen on having a cluster wide distribution of Sessions as against a Partition wise distribution is because once the distribution is done all node are again equale (in terms of load). We wont have nodes of a particular cluster being heavily loaded all the times (bcoz one of there nodes went down) while having nodes of another cluster idle away CPU cycles. comments? hb -----Original Message----- From: Jules Gosnell [mailto:jules@coredevelopers.net] Sent: Thursday, October 30, 2003 3:10 AM To: Geronimo Developers List Subject: Web State Replication... (long) I thought I would present my current design for state replication. Please see the enclosed diagram. First, some terms : n - the number of nodes in the cluster s - a single copy of one node's state (a small coloured box) b - the number of 'buddies' (n) in a partition and therefore copies of each 's' (1) n=3 b=3 State is replicated clockwise Red's state is backed up by Green and Yellow Green's state is backed up by Yellow and Red etc... (2) introduction of a new node n=4 b=3 Blue joins cluster Yellow and Red take responsibility for backing up Blue's state (cuurently empty). Blue takes responsibility for backing up Green and Red's state. A copy of Green's state 'migrates' from Red to Blue A copy of Red's state 'migrates' to Blue Since blue entered with no state the total migration cost is 2s (3) node loss Blue leaves the cluster The lost copy of Red's state is replicated from Green to Yellow The lost copy of Green's state is replicated from Yellow to Red The lost copy of Blues state is replicated from Red/Yellow to Green The total migration cost is 3s In summary a node entering the cluster costs (b-1)*s a node leaving the cluster costs b*s these costs are constant, regardless of the value of n (number of nodes in cluster) breaking each node's state into more than the one piece suggested here and scattering these more widely across a larger cluster will complicate the topology and not reduce these costs. You'll note that in (3) each node is now carrying the state of 4, not 3, nodes and that furthermore, if there were other nodes in the cluster, they would not be sharing this burden. The solution is to balance the state around the cluster. How/Whether this can be done depends upon your load-balancer as discussed in a previous posting. If you can explicitly inform your load-balancer to 'unstick' a session from one node and 'restick' it to another, then you have a mechanism whereby you could split the state in the blue boxes up into smaller pieces, migrate them equally to all nodes in the cluster and notify the load-balancer of their new locations (i.e. divide the contents of the blue boxes and subsume them into R,G and Y boxes). If you cannot coordinate with your load-balancer like this, you may have to wait for it to try to deliver a request to Blue and failover to another node. If you can predict which node this will be, you can proactively have migrated the state to it. If you cannot predict where this will be, you have to wait for the request to arrive and then reactively migrate the state in underneath it or then forward/redirect the request to the node that has the state. This piece of decision making needs to be abstracted into an e.g. LoadBalancerPolicy class which can be plugged in to provide the correct behaviour for your deployment environment. The passivation of aging sessions into a shared store can mitigate migration costs by reducing the number of active (in-vm) sessions that each node's state comprises, whilst maintaining their availability. Migration should always be sourced from a non-primary node since there will be less contention on backed up state (which whill only be receiving writes) than on primary state (which will be the subject of reads and writes). Furthermore, using the scheme I presented for lazy demarshalling, you can see that the cost of marshalling backed-up state will be significantly less than that of marshalling primary state as potentially large object trees will already have been flattened into byte[]. The example above has only considered the web tier. If this is colocated with e.g. an EJB tier, then the cost of migration goes up, since web state is likely to have associations with EJB state which are best maintained locally. This means that if you migrate web state you should consider migrating the related EJB state along with it. So a wholistic approach which considers all tiers together is required. I hope that the model that I am proposing here is sufficiently generic to be of use for the clustering of other tiers/services as well. That's enough for now, Jules -- /************************************* * Jules Gosnell * Partner * Core Developers Network (Europe) * http://www.coredevelopers.net *************************************/