Return-Path: Delivered-To: apmail-incubator-geronimo-dev-archive@www.apache.org Received: (qmail 76151 invoked from network); 30 Oct 2003 18:06:33 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 30 Oct 2003 18:06:33 -0000 Received: (qmail 24806 invoked by uid 500); 30 Oct 2003 18:06:18 -0000 Delivered-To: apmail-incubator-geronimo-dev-archive@incubator.apache.org Received: (qmail 24531 invoked by uid 500); 30 Oct 2003 18:06:16 -0000 Mailing-List: contact geronimo-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: geronimo-dev@incubator.apache.org Delivered-To: mailing list geronimo-dev@incubator.apache.org Received: (qmail 24518 invoked from network); 30 Oct 2003 18:06:16 -0000 Received: from unknown (HELO public.coredevelopers.net) (209.233.18.245) by daedalus.apache.org with SMTP; 30 Oct 2003 18:06:16 -0000 Received: from coredevelopers.net (dain [208.42.65.236]) (using TLSv1 with cipher DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by public.coredevelopers.net (Postfix on SuSE Linux 8.0 (i386)) with ESMTP id D71311F8C8 for ; Thu, 30 Oct 2003 10:04:15 -0800 (PST) Date: Thu, 30 Oct 2003 12:06:16 -0600 Subject: Re: Web State Replication... (long) Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v552) From: Dain Sundstrom To: geronimo-dev@incubator.apache.org Content-Transfer-Encoding: 7bit In-Reply-To: <3FA14DBB.6020705@coredevelopers.net> Message-Id: X-Mailer: Apple Mail (2.552) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Jules, IIRC James point is having a lot more buckets then nodes makes adding and reorganizing state much easier. Of course in the case of a failure you still have bulk transfer of data, but the bulk transfer is spread across the cluster. This help avoid a dominions style cascade delete where the first node dies and then is backup dies from the bulk transfer load, and then that backup dies and so on. Anyway, I think the big benefit is the ease of redistributing sessions. Instead of a new node saying I'll take these 3k sessions, it says I'll take these three buckets. The load is much less but I think the biggest benefit is the code should be easier to debug, understand and write. It is not important now. As long as we keep the interface simple and clean we can try many implementations until something fits. -dain On Thursday, October 30, 2003, at 11:43 AM, Jules Gosnell wrote: > James Strachan wrote: > >> n Thursday, October 30, 2003, at 12:19 pm, gianny DAMOUR wrote: >> >>> Hello, >>> >>> Just a couple of questions regarding this design: >>> >>> - Is it possible to configure the weight of a node? If yes, is the >>> same auto-partitioning policy applicable? My concern is that a >>> "clockwise" policy may add a significant load on nodes hosted by low >>> spec hosts. >> >> >> This is partly a problem for the sticky load balancer to deal with >> i.e. it should load requests to primary machines based on spec/power. >> >> If we partitioned the session data into buckets (rather than one big >> lump), then the buckets of session data can be distributed evenly >> around the cluster so that each session bucket has N buddies >> (replicas) but that a load-balancing algorithm could be used to >> distribute the buckets based on (say) a host spec weighting or >> whatnot. e.g. nodes in the cluster could limit how many buckets to >> accept due to their lack of resources etc. >> >> Imagine having 1 massive box and 2 small ones in a cluster - you'd >> probably want to give the big box more buckets than the smaller ones. >> The previous model Jules described still holds (that was a view of 1 >> session bucket) - its just that the total session state for a machine >> might be spread over many buckets. >> >> Having multiple buckets could also help spread the load of recovering >> from a node failure in larger clusters. > > James, I have given this quite a bit of thought... and whilst it was > initially appealing and seemed a sensible extension of my train of > thought, I have not been able to find any advantage in splitting one > nodes state into mutiple buckets.... > > If a node joins or leaves, you still have exactly the same amount of > state to shift around the cluster. > > If you back up your sessions off-node, then whether these are all on > one backup node, or spread over 10 makes no difference, since in the > first case if you lose the backup node you have to shift 100% x 1 > nodes state. In the second case you have to shift 10% x 10 nodes state > (since the backup node will be carrying 10% of the state of another 9 > nodes as well as your own). Initially it looks more resilient but... > > So I am sticking, by virtue of Occam's razor, to the simpler approach > for them moment, until someone can draw attention to a situation where > the extra complexity of a higher granularity replication strategy is > worth the gain. > > > Thinking about it, my current design is probably hybrid - since whilst > a nodes state is all held in a single bucket, individual sessions may > be migrated out of that bucket and into another one on another node. > So it is replication granularity that is set to node-level, but > migration granularity is at session level. I guess you are suggesting > that a bucket is somewhere between the two of these and is the level > at which both are replicated and migrated ? I'll give it some more > thought :-) > > > Jules > > >> >> >> >>> >>> - I have the feeling that one can not configure a preferred >>> replication group for primary sessions of a specific node: if four >>> nodes are available, I would like to configure that sessions of the >>> first node should be replicated by the third node, if available, or >>> the fourth one. >>> >>> - Is it not an overhead to have b-1 replica? AFAIK, a single >>> secondary should be enough. >> >> >> It all depends on your risk profile I suppose. I backup is usually >> enough but you may want 2 for extra resilience - especially as one of >> those could be in a separate DR zone for really serious fail-over >> scenarios. >> >> James >> ------- >> http://radio.weblogs.com/0112098/ >> > > > -- > /************************************* > * Jules Gosnell > * Partner > * Core Developers Network (Europe) > * http://www.coredevelopers.net > *************************************/ > > > /************************* * Dain Sundstrom * Partner * Core Developers Network *************************/