geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <ju...@coredevelopers.net>
Subject Re: [clustering] automatic partitoning, state bucketing, etc... (long)
Date Fri, 15 Aug 2003 09:16:55 GMT
Robert Stupp wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi!
>
> Heavy stuff ;)
>
> Your thoughts are great.
>
> It is not too early to think about load-balancing and fail-over. The
> whole application server infrastructure has to be at least prepared for
> it - both a concept and a "raw" API should be there to be considered at
> this early state. 


I agree. Geronimo design should consider clustering from the beginning, 
so that this does not end up as a clumsy bolt on.

>
>
> Using "real" fail-over requires other components (e.g. database
> repositories) to be fail-safe, too (Oracle clusters for example).

you could spend a lot money :-)

Jules

>
> Jules Gosnell wrote:
> |
> | I know this is a little early, but, you don't have to read it :-)
> |
> |
> | I've been playing around with some ideas about [web] clustering and
> | distribution of session state that I would like to share...
> |
> |
> | My last iteration on this theme (Jetty's distributable session
> | manager) left me with an implementation in which every node in the
> | cluster held a backup copy of every other nodes state.
> |
> | This is fine for small clusters, but the larger the cluster and total
> | state, the less it scales, because every node needs more memory and
> | more bandwidth to deal with holding and keeping up to date, these
> | backups.
> |
> | The standard way around this is to 'partition' your
> | cluster. i.e. break it up into lots of subclusters, which are small
> | enough not to display this problem. This complicates load-balancing
> | (as the load-balancer needs to know which nodes belong to which
> | partitions i.e. carry which state). Furthermore, in my implementation,
> | the partitions needed to be configured statically, which is a burden
> | in terms of time and understanding on the administrator and means that
> | their layout will probably end up being suboptimal or, even worse,
> | broken. You also need more redundancy because a static configuration
> | cannot react to runtime events, by say repartitioning, if a node goes
> | down, so you would be tempted to put more nodes in each partition
> | etc...
> |
> | So, I am now thinking about dynamically 'auto-partitioning' clusters
> | with coordinated load-balancing.
> |
> | The idea is to build a reliable underlying substrate of partitions
> | which reorganise dynamically as nodes leave and join the cluster.
> |
> | I have something up and running which does this OK - basically nodes
> | are arranged in a circle. Each node and the
> | however-many-nodes-you-want-in-a-partition before it, form a
> | partition. Partitions therefore overlap around the clock. In fact,
> | every node will belong to the same number of partitions as there are
> | members in each one.
> |
> | As a node joins the cluster, the partitions are rearranged to
> | accomodate it whilst maintaining this strategy. It transpires that n-1
> | (where n is the number of nodes per partition) nodes directly behind
> | the new arrival have to pass ownership of an existing partition
> | membership to the new node and a new partition has to be created
> | encompassing these nodes and the new one. So for each new node a total
> | of n-1 partition releases and n partition acquisitions occurs.
> |
> | As a node leaves the cluster, exactly the same happens in
> | reverse. This applies until the total number of nodes in the cluster
> | reaches and decreases below n, in which case you can optimise the
> | number of partitions down to 1.
> |
> | I'm happy (currently) with this arrangement as far as a robust
> | underlying platform for state replication. The question is whether the
> | 'buckets' that state is held in should map directly to these
> | partitions (i.e. each partition maintains a single bucket), or whether
> | their granularity should be finer.
> |
> | I'm still thinking about that at the moment, but thought I would throw
> | my thoughts so far into the ring and see how they go..
> |
> | If you map 1 bucket:1 partition you will end up with a very unevenly
> | balanced, in terms of state, cluster. Every time a node joins,
> | although it will acquire n-1 'buddies' and copies of their state, you
> | won't be able to balance-up the existing state across the cluster
> | because to give it some of it's own state you would need to take it
> | from another partition. The mapping is 1:1 so the smallest unit of
> | state that you can trasfer from one partition to another is it's
> | entire contents/bucket.
> |
> | The other extreme is 1 bucket:1 session (assuming we are talking
> | session state here) - but then why have buckets.. ?
> |
> | My next thought was that the number of buckets should be the square of
> | the number of nodes in the cluster. That way, every node carries
> | (where n is the total number of nodes in the cluster) n buckets. When
> | you add a new node you can tranfer a bucket from each existing node to
> | it. Each other node creates two empty buckets and the new node creates
> | one. State is nicely balanced etc... However, this means that the
> | number of buckets increases exponentially. Which smacks of something
> | unscalable to me and was exactly the reason that I added
> | partitions. In the web world, I shall probably have to map a bucket to
> | an e.g. mod_jk worker. (this entails appending the name of the bucket
> | to the session-id and telling Apache/mod_jk to associate a particular
> | host:port with this name. I can alter the node on which the bucket is
> | held, by informing Apache/mod_jk that this mapping has changed. Moving
> | a session from one bucket to another is problematic and probably
> | impossible as I would have to change the session-id that the client
> | holds. Whilst I could do this if I waited for a request from a single
> | threaded client, with a multi-threaded client it becomes much more
> | difficult...). If we had 25 nodes in the cluster, this would mean 625
> | Apache/mod_jk workers... I'm not sure that mod_jk would like this :-).
> |
> | So, I am currently thinking about having a fixed number of buckets per
> | node which increases linearly with the total size of the cluster, or a
> | 1:1 mapping and living with the inability to immediately balance the
> | cluster by dynamically altering the node weightings in the
> | loadbalancer (mod_jk), so that the cluster soon evens out again...
> |
> | I guess that's enough for now....
> |
> | If this is an area that interests you, please give me your thoughts
> | and we can massage them all together :-)
> |
> |
> | Jules
> |
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQE/PJl5Kk+yu2LQ72QRAn7zAJsFl99kvWSwAw8CHnnV6QFcxpF75ACeIAmD
> JkOgMir1GpH1CusMIBsmD/Q=
> =EnaZ
> -----END PGP SIGNATURE-----
>


-- 
/*************************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 * http://www.coredevelopers.net
 *************************************/



Mime
View raw message