geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <>
Subject [clustering] automatic partitoning, state bucketing, etc... (long)
Date Fri, 15 Aug 2003 07:39:02 GMT

I know this is a little early, but, you don't have to read it :-)

I've been playing around with some ideas about [web] clustering and
distribution of session state that I would like to share...

My last iteration on this theme (Jetty's distributable session
manager) left me with an implementation in which every node in the
cluster held a backup copy of every other nodes state.

This is fine for small clusters, but the larger the cluster and total
state, the less it scales, because every node needs more memory and
more bandwidth to deal with holding and keeping up to date, these

The standard way around this is to 'partition' your
cluster. i.e. break it up into lots of subclusters, which are small
enough not to display this problem. This complicates load-balancing
(as the load-balancer needs to know which nodes belong to which
partitions i.e. carry which state). Furthermore, in my implementation,
the partitions needed to be configured statically, which is a burden
in terms of time and understanding on the administrator and means that
their layout will probably end up being suboptimal or, even worse,
broken. You also need more redundancy because a static configuration
cannot react to runtime events, by say repartitioning, if a node goes
down, so you would be tempted to put more nodes in each partition

So, I am now thinking about dynamically 'auto-partitioning' clusters
with coordinated load-balancing.

The idea is to build a reliable underlying substrate of partitions
which reorganise dynamically as nodes leave and join the cluster.

I have something up and running which does this OK - basically nodes
are arranged in a circle. Each node and the
however-many-nodes-you-want-in-a-partition before it, form a
partition. Partitions therefore overlap around the clock. In fact,
every node will belong to the same number of partitions as there are
members in each one.

As a node joins the cluster, the partitions are rearranged to
accomodate it whilst maintaining this strategy. It transpires that n-1
(where n is the number of nodes per partition) nodes directly behind
the new arrival have to pass ownership of an existing partition
membership to the new node and a new partition has to be created
encompassing these nodes and the new one. So for each new node a total
of n-1 partition releases and n partition acquisitions occurs.

As a node leaves the cluster, exactly the same happens in
reverse. This applies until the total number of nodes in the cluster
reaches and decreases below n, in which case you can optimise the
number of partitions down to 1.

I'm happy (currently) with this arrangement as far as a robust
underlying platform for state replication. The question is whether the
'buckets' that state is held in should map directly to these
partitions (i.e. each partition maintains a single bucket), or whether
their granularity should be finer.

I'm still thinking about that at the moment, but thought I would throw
my thoughts so far into the ring and see how they go..

If you map 1 bucket:1 partition you will end up with a very unevenly
balanced, in terms of state, cluster. Every time a node joins,
although it will acquire n-1 'buddies' and copies of their state, you
won't be able to balance-up the existing state across the cluster
because to give it some of it's own state you would need to take it
from another partition. The mapping is 1:1 so the smallest unit of
state that you can trasfer from one partition to another is it's
entire contents/bucket.

The other extreme is 1 bucket:1 session (assuming we are talking
session state here) - but then why have buckets.. ?

My next thought was that the number of buckets should be the square of
the number of nodes in the cluster. That way, every node carries
(where n is the total number of nodes in the cluster) n buckets. When
you add a new node you can tranfer a bucket from each existing node to
it. Each other node creates two empty buckets and the new node creates
one. State is nicely balanced etc... However, this means that the
number of buckets increases exponentially. Which smacks of something
unscalable to me and was exactly the reason that I added
partitions. In the web world, I shall probably have to map a bucket to
an e.g. mod_jk worker. (this entails appending the name of the bucket
to the session-id and telling Apache/mod_jk to associate a particular
host:port with this name. I can alter the node on which the bucket is
held, by informing Apache/mod_jk that this mapping has changed. Moving
a session from one bucket to another is problematic and probably
impossible as I would have to change the session-id that the client
holds. Whilst I could do this if I waited for a request from a single
threaded client, with a multi-threaded client it becomes much more
difficult...). If we had 25 nodes in the cluster, this would mean 625
Apache/mod_jk workers... I'm not sure that mod_jk would like this :-).

So, I am currently thinking about having a fixed number of buckets per
node which increases linearly with the total size of the cluster, or a
1:1 mapping and living with the inability to immediately balance the
cluster by dynamically altering the node weightings in the
loadbalancer (mod_jk), so that the cluster soon evens out again...

I guess that's enough for now....

If this is an area that interests you, please give me your thoughts
and we can massage them all together :-)


 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)

View raw message