Mailing-List: contact geronimo-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: geronimo-dev@incubator.apache.org
Message-ID: <284BFFB1223B3F4598A3A388EEFDAFC6541E21@SW720EX015.visa.com>
From: "Bhagwat, Hrishikesh" <hbhagwat@visa.com>
To: "'geronimo-dev@incubator.apache.org'"
	 <geronimo-dev@incubator.apache.org>
Subject: RE: Web State Replication... (long)
Date: Thu, 30 Oct 2003 17:11:47 -0800
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"

hey jules,

Happy to see your design .. we are now good to compare notes .. 
infact I and vivek did just that during this morning and
here are some comments. I did like to say that there are really 3 
"independent" issues involved.

1. The issue of "CREATING and REORGANIZING" partitions and "NODE BACKUP"

2. The issue of Session Migration in face of a Failover. 

3. The issue of integration with LB.

In summary here is what i have to say :

So far as point (1) is concerned, I think that this proposal is more of a 
"special case" scenerio of the generic approach mentioned in my proposal. 
please see section "A" below for an explanation to this comment

So far as point (2) is concerned, while reading through the proposal and 
comparing it with my proposal we did realize that domino effect that a 
failover can have (in my design)... we have come up with a way of 
executing LAZY-REDISTRIBUTION. see section "B"

So far as point (3) is concerned, Because our designs are so lot similar 
... i think any good thoughts on this point can be incorporated in either 
of the designs just as well.


Section "A"
============

The way your design (i like to call this "I") appears to me as 
a "special case" of the "auto-partitioning" design (will call 
this "II") that I had earlier put forward. In other words the 
"I" DESIGN "can be DERIVED" from the "II" desing.

Please consider this

If we finally implement the system using the "II" and have the
following set of values then we will come to STATE-(1) of the "I"
design

PARAM
------

Node List :   RED - GREEN - YELLOW
cutover_count : 3 (current partition size)
MAX_CUT_OVER : 4 (partitions cannot have more then 4 servers)
MIN_CUT_OVER : 3 (partitions cannot have less than 3 servers)


Now when you add BLUE Node the resultant state-(2)will arise
from having 

Node List :   RED - GREEN - YELLOW - BLUE
cutover_count : 4 (current partition size)
MAX_CUT_OVER : 4 (partitions cannot have more then 3 servers)
MIN_CUT_OVER : 3 (partitions cannot have less than 3 servers)

Thus till this point "I" can be completely derived from "II". 
Further due to its dynamic nature, "II" has a the following
distinctive advantages.


1. "The amount of processing each Node of a partition must undertake
  in order to fully back up all nodes of that partition depends on 
  the number of servers in that partition" .. in other words a each
  node of a partition that has just 3 servers will spend (relatively)
  less time doing back-up activity than a node that is a member of
  a partition having 8 nodes. 

  LESS THE NUMBER OF SERVERS IN A PARTITION, MORE IT IS THAT EACH SERVER
  CAN CONCENTRATE ON SERVING CLIENT REQUESTS THAN ON BACKUP-OPERATIONS.
  thus improving performance .. but loosing robustness (fewer backups).

  MAX_CUT_OVER lets us specify the maximum number of server that can
  be allowed in a Partition thus letting you specify the LOWER BOUND ON 
  PERFORMANCE. similarly MIN_CUT_OVER lets you put LOWER bound on ROBUSTNESS.

I think we must thus stick to the approach where the system sway between
MIN and MAX values depending upon load.


Section "B"
============

The problem with the RECOVERY process in my initial proposal was that 
on failure of a Node A ... its sessions will be sliced and each slice
would be sent out a  specific node in the cluster. Till this point I 
think all would be ok ... problems would start when each node would start
to back up these newly received nodes on to other nodes. Increasing the
traffic multifolds.

I and Vivek have thought of a solution for LAZY-REDISTRIBTION. The idea is that
say "A" fails and so any further requests for A are sent to B. As in the earlier
proposal ... B will service this request and simulteneously start a process to
slice A's backup into no of live servers. It will then make a map of SLICE and 
SLICE_OWNERS (nodes). Now when another of A's request hits "B" it will check the 
slice to which this Session object belongs to. If the slice to which it belongs 
is owned by say D ... B will forward to "D" this request and also ITS SLICE OF 
A's sessions. Hereo-on D will take up the responsibility of serving all SESSIONS
that fall in the SLICE it was supplied with. It will inform the LB about this.
Once this is done B need not any more hold D's slice in memory. Thus the process
of distribution of SLICES will be spread over a substantial time period.

Finally, the reason why i am keen on having a cluster wide distribution of Sessions as
against a Partition wise distribution is because once the distribution is done
all node are again equale (in terms of load). We wont have nodes of a particular
cluster being heavily loaded all the times (bcoz one of there nodes went down) while
having nodes of another cluster idle away CPU cycles.

comments?
hb


-----Original Message-----
From: Jules Gosnell [mailto:jules@coredevelopers.net]
Sent: Thursday, October 30, 2003 3:10 AM
To: Geronimo Developers List
Subject: Web State Replication... (long)


I thought I would present my current design for state replication. 
Please see
the enclosed diagram.

First, some terms :

n - the number of nodes in the cluster
s - a single copy of one node's state (a small coloured box)
b - the number of 'buddies' (n) in a partition and therefore copies of 
each 's'


(1)

n=3
b=3

State is replicated clockwise
Red's state is backed up by Green and Yellow
Green's state is backed up by Yellow and Red
etc...

(2) introduction of a new node

n=4
b=3

Blue joins cluster
Yellow and Red take responsibility for backing up Blue's state 
(cuurently empty).
Blue takes responsibility for backing up Green and Red's state.
A copy of Green's state 'migrates' from Red to Blue
A copy of Red's state 'migrates' to Blue
Since blue entered with no state the total migration cost is 2s

(3) node loss

Blue leaves the cluster
The lost copy of Red's state is replicated from Green to Yellow
The lost copy of Green's state is replicated from Yellow to Red
The lost copy of Blues state is replicated from Red/Yellow to Green
The total migration cost is 3s

In summary

a node entering the cluster costs (b-1)*s
a node leaving the cluster costs b*s

these costs are constant, regardless of the value of n (number of
nodes in cluster)

breaking each node's state into more than the one piece suggested here and
scattering these more widely across a larger cluster will complicate the 
topology
and not reduce these costs.


You'll note that in (3) each node is now carrying the state of 4, not
3, nodes and that furthermore, if there were other nodes in the
cluster, they would not be sharing this burden.

The solution is to balance the state around the cluster.

How/Whether this can be done depends upon your load-balancer as
discussed in a previous posting.

If you can explicitly inform your load-balancer to 'unstick' a session
from one node and 'restick' it to another, then you have a mechanism
whereby you could split the state in the blue boxes up into smaller
pieces, migrate them equally to all nodes in the cluster and notify
the load-balancer of their new locations (i.e. divide the contents of the
blue boxes and subsume them into R,G and Y boxes).

If you cannot coordinate with your load-balancer like this, you may
have to wait for it to try to deliver a request to Blue and failover
to another node. If you can predict which node this will be, you can
proactively have migrated the state to it. If you cannot predict where
this will be, you have to wait for the request to arrive and then
reactively migrate the state in underneath it or then forward/redirect
the request to the node that has the state.

This piece of decision making needs to be abstracted into an
e.g. LoadBalancerPolicy class which can be plugged in to provide the
correct behaviour for your deployment environment.

The passivation of aging sessions into a shared store can mitigate
migration costs by reducing the number of active (in-vm) sessions that
each node's state comprises, whilst maintaining their availability.

Migration should always be sourced from a non-primary node since there
will be less contention on backed up state (which whill only be
receiving writes) than on primary state (which will be the subject of
reads and writes). Furthermore, using the scheme I presented for lazy
demarshalling, you can see that the cost of marshalling backed-up
state will be significantly less than that of marshalling primary
state as potentially large object trees will already have been
flattened into byte[].

The example above has only considered the web tier. If this is
colocated with e.g. an EJB tier, then the cost of migration goes up,
since web state is likely to have associations with EJB state which
are best maintained locally. This means that if you migrate web state
you should consider migrating the related EJB state along with it. So
a wholistic approach which considers all tiers together is required.

I hope that the model that I am proposing here is sufficiently generic
to be of use for the clustering of other tiers/services as well.


That's enough for now,


Jules

-- 
/*************************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 * http://www.coredevelopers.net
 *************************************/