geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <>
Subject Re: [clustering] automatic partitoning, state bucketing, etc... (long)
Date Fri, 15 Aug 2003 08:24:25 GMT
I've just talked myself into the fixed number of buckets per partition.

The whole reason behind partitioning and horizontal scaling is to keep 
the load/state on any one node constant. Therefore the number of 
buckets/partition and sessions/bucket mus be constant too.  So both will 
increase linearly along with the number of nodes in the cluster....


Jules Gosnell wrote:

> I know this is a little early, but, you don't have to read it :-)
> I've been playing around with some ideas about [web] clustering and
> distribution of session state that I would like to share...
> My last iteration on this theme (Jetty's distributable session
> manager) left me with an implementation in which every node in the
> cluster held a backup copy of every other nodes state.
> This is fine for small clusters, but the larger the cluster and total
> state, the less it scales, because every node needs more memory and
> more bandwidth to deal with holding and keeping up to date, these
> backups.
> The standard way around this is to 'partition' your
> cluster. i.e. break it up into lots of subclusters, which are small
> enough not to display this problem. This complicates load-balancing
> (as the load-balancer needs to know which nodes belong to which
> partitions i.e. carry which state). Furthermore, in my implementation,
> the partitions needed to be configured statically, which is a burden
> in terms of time and understanding on the administrator and means that
> their layout will probably end up being suboptimal or, even worse,
> broken. You also need more redundancy because a static configuration
> cannot react to runtime events, by say repartitioning, if a node goes
> down, so you would be tempted to put more nodes in each partition
> etc...
> So, I am now thinking about dynamically 'auto-partitioning' clusters
> with coordinated load-balancing.
> The idea is to build a reliable underlying substrate of partitions
> which reorganise dynamically as nodes leave and join the cluster.
> I have something up and running which does this OK - basically nodes
> are arranged in a circle. Each node and the
> however-many-nodes-you-want-in-a-partition before it, form a
> partition. Partitions therefore overlap around the clock. In fact,
> every node will belong to the same number of partitions as there are
> members in each one.
> As a node joins the cluster, the partitions are rearranged to
> accomodate it whilst maintaining this strategy. It transpires that n-1
> (where n is the number of nodes per partition) nodes directly behind
> the new arrival have to pass ownership of an existing partition
> membership to the new node and a new partition has to be created
> encompassing these nodes and the new one. So for each new node a total
> of n-1 partition releases and n partition acquisitions occurs.
> As a node leaves the cluster, exactly the same happens in
> reverse. This applies until the total number of nodes in the cluster
> reaches and decreases below n, in which case you can optimise the
> number of partitions down to 1.
> I'm happy (currently) with this arrangement as far as a robust
> underlying platform for state replication. The question is whether the
> 'buckets' that state is held in should map directly to these
> partitions (i.e. each partition maintains a single bucket), or whether
> their granularity should be finer.
> I'm still thinking about that at the moment, but thought I would throw
> my thoughts so far into the ring and see how they go..
> If you map 1 bucket:1 partition you will end up with a very unevenly
> balanced, in terms of state, cluster. Every time a node joins,
> although it will acquire n-1 'buddies' and copies of their state, you
> won't be able to balance-up the existing state across the cluster
> because to give it some of it's own state you would need to take it
> from another partition. The mapping is 1:1 so the smallest unit of
> state that you can trasfer from one partition to another is it's
> entire contents/bucket.
> The other extreme is 1 bucket:1 session (assuming we are talking
> session state here) - but then why have buckets.. ?
> My next thought was that the number of buckets should be the square of
> the number of nodes in the cluster. That way, every node carries
> (where n is the total number of nodes in the cluster) n buckets. When
> you add a new node you can tranfer a bucket from each existing node to
> it. Each other node creates two empty buckets and the new node creates
> one. State is nicely balanced etc... However, this means that the
> number of buckets increases exponentially. Which smacks of something
> unscalable to me and was exactly the reason that I added
> partitions. In the web world, I shall probably have to map a bucket to
> an e.g. mod_jk worker. (this entails appending the name of the bucket
> to the session-id and telling Apache/mod_jk to associate a particular
> host:port with this name. I can alter the node on which the bucket is
> held, by informing Apache/mod_jk that this mapping has changed. Moving
> a session from one bucket to another is problematic and probably
> impossible as I would have to change the session-id that the client
> holds. Whilst I could do this if I waited for a request from a single
> threaded client, with a multi-threaded client it becomes much more
> difficult...). If we had 25 nodes in the cluster, this would mean 625
> Apache/mod_jk workers... I'm not sure that mod_jk would like this :-).
> So, I am currently thinking about having a fixed number of buckets per
> node which increases linearly with the total size of the cluster, or a
> 1:1 mapping and living with the inability to immediately balance the
> cluster by dynamically altering the node weightings in the
> loadbalancer (mod_jk), so that the cluster soon evens out again...
> I guess that's enough for now....
> If this is an area that interests you, please give me your thoughts
> and we can massage them all together :-)
> Jules

 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)

View raw message