Return-Path: Delivered-To: apmail-incubator-geronimo-dev-archive@incubator.apache.org Received: (qmail 11343 invoked by uid 500); 20 Aug 2003 23:20:56 -0000 Mailing-List: contact geronimo-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: geronimo-dev@incubator.apache.org Delivered-To: mailing list geronimo-dev@incubator.apache.org Received: (qmail 11277 invoked from network); 20 Aug 2003 23:20:55 -0000 Received: from adsl-209-233-18-245.dsl.snfc21.pacbell.net (HELO public.coredevelopers.net) (209.233.18.245) by daedalus.apache.org with SMTP; 20 Aug 2003 23:20:55 -0000 Received: from tiger (gateway [192.168.2.253]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by public.coredevelopers.net (Postfix on SuSE Linux 8.0 (i386)) with ESMTP id EEC351692F for ; Wed, 20 Aug 2003 08:37:38 -0700 (PDT) From: "Jeremy Boynes" To: Subject: RE: [clustering] automatic partitoning, state bucketing, etc... (long) Date: Wed, 20 Aug 2003 08:43:58 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0) In-Reply-To: <3F4364EA.7030607@coredevelopers.net> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Importance: Normal X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N > I'm going to pick up this thread again :-) We just can't leave alone :-) > > we have to deal with both dumb and integrated load-balancers... > > DUMB LB: > > (A) undivided cluster, simple approach: > > every node buddies for every other node > no 'locate/migrate' required since every session is on every node > replication needs to be synchronous, in order to guarantee that node on > which nex request falls will be up-to-date > > problem: unscalable > > (B) subdivided cluster, more complex approach: > > cluster subdivided into buddy teams (possibly only of pairs). > 'locate/migrate' required since request may fall on node that does not > have session to hand > primary could use asyn and secondary sync replication, provided that > 'locate' always talked to primary sync and async are both options - sync may be needed for dumb clients (http 1.0 or ones which overlap requests e.g. for frames) > > problem: given a cluster of n nodes divided into teams of t nodes: only > t/n requests will be able to avoid the 'locate/migrate' step - in a > large cluster with small teams, this is not much more efficient than a > shared store solution. > conclusion: DUMB LB is a bad choice in conjunction with a replication model and shared state. Only recommended for use with stateless front ends. > SMART LB (we're assuming it can do pretty much whatever we want it to). We're assuming it is smart in that: 1) it can affinity sessions (including SSL if required) 2) it can detect failed nodes 3) it has (possibly configurable) policies for failing over if a node dies I think that's all the capabilities it needs. > > (A) > > assuming affinity, we can use async replication, because request will > always fall on most up to date node. async is a reliability/performance tradeoff - it introduces a window in which modified state has not been replicated and may be lost. Again sync vs. async should be configurable. > if this node fails, the lb MUST pick one to failover to and continue to > use that one (or else we have to fall back to sync and assume dumb lb) > if original node comes back up, it doesn't matter whether lb goes back > to it, or remains stuck to fail-over node. Premis is the LB will pick a new node and affinity to it. How it picks the new node is undefined (depends on how the LB works) and may result in a locate/migrate step if it picks a node without state. If the old node comes back it will old and will trigger a locate/migrate step if the LB picks it (e.g. if it has a preferential affinity model). > > (B) > > if we can arrange for LB use affinity, with failover limited to our > buddy-group, and always stick to the failover node as well we can lose > 'locate/migrate' and replicate asych. If we can't get 'always stick to > failover node', we replicate synch after failover. Again async only works if you are willing to lose state. > > if we can only arrange affinity, but not failover within group, we can > replicate asynch and need 'locate/migrate'. If we can't have > lb-remains-stuck-to-failover-node, we are in trouble, because as soon as > primary node fails we go back to the situation outlined above where we > do a lot of locate/migrate and are not much better off than a > shared store. > Don't get you on this one - maybe we have a different definition of affinity: mine is that a request will always be directed back to the node that served the last one unless that node becomes unavailable. This means that a request goes to the last node that served it, not the one that originally created the session. Even if you have 'affinity to the node that created the session' then you don't get a lot of locate/migrate - just a burst when the node comes back online. > > The lb-sticks-to-failover-node is not as simple as it sounds - mod_jk > doesn't do it. :-( > > it implies > > either : > > you have the ability to change the routing info carried on the session > id client side (I've considered this and don't think it practical - I > may be wrong ...) I'm dubious about this too - it feels wrong but I can't see what it breaks. I'm assuming that you set JSESSIONID to id.node with node always being the last node that serves it. The LB tries to direct the request to node, but if it is unavailable picks another from its configuration. If the new node does not have state then you do locate/migrate. > > or : > > the session id needs to carry not just a single piece of routing info > (like a mod_jk worker name) but a failover list worker1,worker2,worker3 > etc in effect your buddy-team, > Again, requires that the client handles changes to JSESSIONID OK. This allows the nodes to determine the buddy group and would reduce the chance of locate/migrate being needed. > or: > > the lb needs to maintain state, remembering where each session was last > serviced and always sticking requests for that session to that node. in > a large deployment this requires lbs to replicate this state between > them so that they can balance over the same nodes in a coordinated > fashion. I think F5 Big-IP is capable of this, but effectively you just > shift the state problem from yourself to someone else. Not quite - the LB's are sharing session-to-node affinity data which is very small; the buddies are sharing session state which is much larger. You are sharing the task not shifting it. Yes, the LB's can do this. > > Note that if your lb can understand extended routing info involving the > whole buddy team, then you know that it will always balance requests to > members of this team anyway, in which case you can dispense with > 'locate/migrate' again. It would be useful if the LB did this but I don't think it's a requirement. I don't think you can dispense with locate unless you are willing to lose sessions. For example, if the buddy group is originally (nodeA, nodeB) and both those nodes get cycled out, then the LB will not be able to find a node even if the cluster migrates the data to (nodeC, nodeD). When it see the request come in and knows that A and B are unavailable, it will pick a random nod, say nodeX, and X needs to be able to locate/migrate from C or D. This also saves the pre-emptive transfer of state from C back to A when A rejoins - it only happens if nodeA gets selected. > > Finally - you still need a migrate operation as sessions will need to > migrate from buddy-group to buddy-group as buddy-groups are created and > destroyed... > > > in summary - I think that you can optimise away 'locate' and a lot of > 'migrate'-ion - Jetty's current impl has no locate and you can build > subdivided clusters with it and mod_jk.... but I don't do automatic > repartitioning yet.... > IIRC Jetty's current impl does it by replicating to all nodes in the partition and I thought that's what you were trying to reduce :-) The basic tradeoff is wide-replication vs. locate-after-death - they both work, I just think locate-after-death results in less overhead during normal operation at the cost of more after a membership change, which seems preferable. > > If you are still reading here, then you are doing well :-) Or you're just one sick puppy :-) -- Jeremy