geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Stupp <>
Subject Re: [clustering] automatic partitoning, state bucketing, etc... (long)
Date Fri, 15 Aug 2003 08:27:37 GMT
Hash: SHA1


Heavy stuff ;)

Your thoughts are great.

It is not too early to think about load-balancing and fail-over. The
whole application server infrastructure has to be at least prepared for
it - both a concept and a "raw" API should be there to be considered at
this early state.

Using "real" fail-over requires other components (e.g. database
repositories) to be fail-safe, too (Oracle clusters for example).

Jules Gosnell wrote:
| I know this is a little early, but, you don't have to read it :-)
| I've been playing around with some ideas about [web] clustering and
| distribution of session state that I would like to share...
| My last iteration on this theme (Jetty's distributable session
| manager) left me with an implementation in which every node in the
| cluster held a backup copy of every other nodes state.
| This is fine for small clusters, but the larger the cluster and total
| state, the less it scales, because every node needs more memory and
| more bandwidth to deal with holding and keeping up to date, these
| backups.
| The standard way around this is to 'partition' your
| cluster. i.e. break it up into lots of subclusters, which are small
| enough not to display this problem. This complicates load-balancing
| (as the load-balancer needs to know which nodes belong to which
| partitions i.e. carry which state). Furthermore, in my implementation,
| the partitions needed to be configured statically, which is a burden
| in terms of time and understanding on the administrator and means that
| their layout will probably end up being suboptimal or, even worse,
| broken. You also need more redundancy because a static configuration
| cannot react to runtime events, by say repartitioning, if a node goes
| down, so you would be tempted to put more nodes in each partition
| etc...
| So, I am now thinking about dynamically 'auto-partitioning' clusters
| with coordinated load-balancing.
| The idea is to build a reliable underlying substrate of partitions
| which reorganise dynamically as nodes leave and join the cluster.
| I have something up and running which does this OK - basically nodes
| are arranged in a circle. Each node and the
| however-many-nodes-you-want-in-a-partition before it, form a
| partition. Partitions therefore overlap around the clock. In fact,
| every node will belong to the same number of partitions as there are
| members in each one.
| As a node joins the cluster, the partitions are rearranged to
| accomodate it whilst maintaining this strategy. It transpires that n-1
| (where n is the number of nodes per partition) nodes directly behind
| the new arrival have to pass ownership of an existing partition
| membership to the new node and a new partition has to be created
| encompassing these nodes and the new one. So for each new node a total
| of n-1 partition releases and n partition acquisitions occurs.
| As a node leaves the cluster, exactly the same happens in
| reverse. This applies until the total number of nodes in the cluster
| reaches and decreases below n, in which case you can optimise the
| number of partitions down to 1.
| I'm happy (currently) with this arrangement as far as a robust
| underlying platform for state replication. The question is whether the
| 'buckets' that state is held in should map directly to these
| partitions (i.e. each partition maintains a single bucket), or whether
| their granularity should be finer.
| I'm still thinking about that at the moment, but thought I would throw
| my thoughts so far into the ring and see how they go..
| If you map 1 bucket:1 partition you will end up with a very unevenly
| balanced, in terms of state, cluster. Every time a node joins,
| although it will acquire n-1 'buddies' and copies of their state, you
| won't be able to balance-up the existing state across the cluster
| because to give it some of it's own state you would need to take it
| from another partition. The mapping is 1:1 so the smallest unit of
| state that you can trasfer from one partition to another is it's
| entire contents/bucket.
| The other extreme is 1 bucket:1 session (assuming we are talking
| session state here) - but then why have buckets.. ?
| My next thought was that the number of buckets should be the square of
| the number of nodes in the cluster. That way, every node carries
| (where n is the total number of nodes in the cluster) n buckets. When
| you add a new node you can tranfer a bucket from each existing node to
| it. Each other node creates two empty buckets and the new node creates
| one. State is nicely balanced etc... However, this means that the
| number of buckets increases exponentially. Which smacks of something
| unscalable to me and was exactly the reason that I added
| partitions. In the web world, I shall probably have to map a bucket to
| an e.g. mod_jk worker. (this entails appending the name of the bucket
| to the session-id and telling Apache/mod_jk to associate a particular
| host:port with this name. I can alter the node on which the bucket is
| held, by informing Apache/mod_jk that this mapping has changed. Moving
| a session from one bucket to another is problematic and probably
| impossible as I would have to change the session-id that the client
| holds. Whilst I could do this if I waited for a request from a single
| threaded client, with a multi-threaded client it becomes much more
| difficult...). If we had 25 nodes in the cluster, this would mean 625
| Apache/mod_jk workers... I'm not sure that mod_jk would like this :-).
| So, I am currently thinking about having a fixed number of buckets per
| node which increases linearly with the total size of the cluster, or a
| 1:1 mapping and living with the inability to immediately balance the
| cluster by dynamically altering the node weightings in the
| loadbalancer (mod_jk), so that the cluster soon evens out again...
| I guess that's enough for now....
| If this is an area that interests you, please give me your thoughts
| and we can massage them all together :-)
| Jules
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla -


View raw message