geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Gosnell <>
Subject Re: Clustering (long)
Date Wed, 27 Jul 2005 12:37:07 GMT
Andy Piper wrote:

> Hi Jules
> It sounds like you've been working hard!

yes - I need a break :-)

> I think you might find you run into reliability issues with a 
> singleton coordinator. This is one of those well known Hard Problems 
> and for session replication its not really necessary. In essence the 
> coordinator is a single point of failure and making it reliable is 
> provably non trivial.

I agree on the SPoF thing - but I think you misunderstand my Coordinator 
arch. I do not have a single static Coordinator node, but a dynamic 
Coordinator role, into which a node may be elected. Thus every node is a 
potential Coordinator. If the elected Coordinator dies, another is 
immediately elected. The election strategy is pluggable, although it 
will probably end up being hardwired to "oldest-cluster-member". The 
reason behind this is that relaying out your cluster is much simpler if 
it is done in a single vm. I originally tried to do it in multiple vms, 
each taking responsibility for pieces of the cluster, but if the vms 
views are not completely in sync, things get very hairy, and completely 
in sync is an expensive thing to achieve - and would introduce a 
cluster-wide single point of contention. So I do it in a single vm, as 
fast as I can, with fail over, in case that vm evaporates. Does that 
sound better than the scenario that you had in mind ?

The Coordinator is not there to support session replication, but rather 
the management of the distributed map (map of which a few buckets live 
on each node) which is used by WADI to discover very efficiently whether 
a session exists and where it is located. This map must be rearranged, 
in the most efficient way possible, each time a node joins or leaves the 

Replication is NYI - but I'm running a few mental background threads 
that suggest that an extension to the index will mean that it associates 
the session's id not just to its current location, but also to the 
location of a number of replicants. I also have ideas on how a session 
might choose nodes into which it will place its replicants and how I can 
avoid the primary session copy ever being colocated with a replicant 
(potential SPoF - if you only have one replicant), etc...

> Cluster re-balancing is also a hairy issue in that its easy to run 
> into cascading failures if you pro-actively move sessions when a 
> server leaves the cluster.

Yes, I can see that happening - I have an improvement (NYI) to WADI's 
evacuation strategy (how sessions are evacuated when a node wishes to 
leave). Each session will be evacuated to the node which owns the bucket 
into which its id hashes. This is because colocation of the session with 
the bucket allows many messages concered with its future destruction and 
relocation to be optimised away. Future requests falling elsewhere but 
needing this session should, in the most efficient case, be relocated to 
this same node, other wise the session may be relocated, but at a cost...

> I'm happy to talk more about these issues off-line if you want.

I would be very grateful in any thoughts or feedback that you could give 
me. I hope to get much more information about WADI into the wiki over 
the next few weeks. That should help generate more discussion, although 
I would be more than happy for people to ask me questions here on 
Geronimo-dev because this will give me an idea of what documentation I 
should write and how existing documentation may be lacking or misleading.

Please fire away,


> Thanks
> andy
> At 02:31 PM 6/30/2005, Jules Gosnell wrote:
>> Guys,
>> I thought that it was high time that I brought you up to date with my
>> efforts in building a clustering layer for Geronimo.
>> The project,, started as an effort to build a
>> scalable clustered HttpSession implementation, but in doing this, I
>> have built components that should be useful in clustering the state
>> held in any tier of Geronimo e.g. OpenEJB SFSBs etc.
>> WADI (Web Application Distribution Infrastructure) has two main
>> elements - the vertical and the horizontal.
>> Vertically, WADI comprises a stack of pluggable stores. Each store has
>> a pluggable Evicter responsible for demoting aging Sessions
>> downwards. Requests arriving at the container are fed into the top of
>> the stack and progress downwards, until their corresponding Session is
>> found and promoted to the top, where the request is correctly rendered
>> in its presence.
>> Typically the top-level store is in Memory. Aging Sessions are demoted
>> downwards onto exclusively owned LocalDisc. The bottom-most store is a
>> database shared between all nodes in the Cluster. The first node
>> joining the Cluster promotes all Sessions from the database into
>> exclusively-owned store - e.g. LocalDisc. The last node to leave the
>> Cluster demotes all Sessions down back into the database.
>> Horizontally, all nodes in a WADI Cluster are connected (p2p) via a
>> Clustered Store component within this stack. This typically sits at
>> the boundary between exclusive and shared Stores. As requests fall
>> through the stack, looking for their corresponding Session they arrive
>> at the Clustered store, where, if the Session is present anywhere in
>> the Cluster, its location may be learnt. At this point, the Session
>> may be migrated in, underneath the incoming request, or, if its
>> current location is considered advantageous, the request may be
>> proxied or redirected to its remote location. As a node leaves the
>> Cluster, all its Sessions are evacuated to other nodes via this store,
>> so that they may continue to be actively maintained.
>> The space in which Session ids are allocated is divided into a fixed
>> number of Buckets. This number should be large enough such that
>> management of the Buckets may be divided between all nodes in the
>> Cluster roughly evenly. As nodes leave and join the Cluster, a single
>> node, the Coordinator, is responsible for re-Bucketing the Cluster -
>> i.e. reorganising who manages which Buckets and ensuring the safe
>> transfer of the minimum number of Buckets to implement the new
>> layout. The Coordinator is elected via a Pluggable policy. If the
>> Coordinator leaves or fails, a new one is elected. If a node leaves or
>> joins, buckets emigrate from it or immigrate into it, under the
>> control of the Coordinator, to/from the rest of the Cluster.
>> A Session may be efficiently mapped to a Bucket by simply %-ing its
>> ID's hashcode() by the number of Buckets in the Cluster.
>> A Bucket is a map of SessionID:Location, kept up to date with the
>> Location of every Session in the Cluster, of which the id falls into
>> its space. i.e. as Sessions are created, destroyed or moved around the
>> Cluster notifications are sent to the node managing the relevant
>> Bucket, informing it of the change.
>> In this way, if a node receives a request for a Session which it does
>> not own locally, it may pass a message to it, in a maximum of
>> typically two hops, by sending the message to the Bucket owner, who
>> then does a local lookup of the Sessions actual location and forwards
>> the message to it. If Session and Bucket can be colocated, this can
>> reduced to a single hop.
>> Thus, WADI provides a fixed and scalable substrate over the more fluid
>> arrangement that Cluster membership comprises, on top of which further
>> Clustered services may be built.
>> The above functionality exists in WADI CVS and I am currently working
>> on hardening it to the point that I would consider it production
>> strength. I will then consider the addition of some form of state
>> replication, so that, even with the catastrophic failure of a member
>> node, no state is lost from the Cluster.
>> I plan to begin integrating WADI with Geronimo as soon as a certified
>> 1.0-based release starts to settle down. Certification is the most
>> immediate goal and clustering is not really part of the spec, so I
>> think it best to satisfy the first before starting on subsequent
>> goals.
>> Further down the road we need to consider the unification of session
>> id spaces used by e.g. the web and ejb tiers and introduction of an
>> ApplicationSession abstraction - an object which encapsulates all
>> e.g. web and ejb sessions associated with a particular client talking
>> to a particular application. This will allow WADI to maintain the
>> colocation of associated state, whilst moving and replicating it
>> around the Cluster.
>> If anyone would like to know more about WADI, please feel free to ask
>> me questions here on geronimo-dev or on wadi-dev.
>> Thanks for listening,
>> Jules

"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 * Open Source Training & Support.

View raw message