jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Boston <...@tfd.co.uk>
Subject Re: [jr3] Clustering: Scalable Writes / Asynchronous Change Merging
Date Fri, 22 Oct 2010 11:01:36 GMT

On 19 Oct 2010, at 11:24, Thomas Müller wrote:

> changeSetId = nanosecondsSince1970 * totalClusterNodes + clusterNodeId

I have spent some time doing experiments on this, here are some observations based on those
experiments. (could be totally irrelevant to this discussion, sorry for the noise if it is)

A sequential number helps since you know the order and can maintain lookahead and lookback
buffers to ensure you replay in sequence.
Performing master election and then working mean time between all nodes in the cluster and
offset of time averaging a request response travel time gives offset of all nodes relative
to one another allowing real time clock sync with diffs, you do need to do more than one message
in the master election to generate a good measurement of time offset. ( ideally measure until
mean is stable)

Once you have that, I would recommend BigInt storage and reserve 4 digits for the cluster
I would also not use nano seconds but make the ID generator in each JVM synchronised and start
a rapid counter reserving 10K ticks per ms. Server offsets are never going to be ns accurate
and who knows which conflicting change is right in the ns range. so the structure of the change
set ID is:


This generates a cluster wide sequence where each counter runs independently on each server
only blocked by  very small sync block. The sequence has holes in it. Interestingly if you
can accept that the sequence is reliable (ie after a certain time delay all events have arrived
and none were lost), it can be used as the basis for the sequence in the current (2x) ClusterNode
implementation as the replay of a Journal does not require a hole free sequence. (AFAICT)
 The sequence can be checked for lost events if each subsequent event contains the previous

The clusterNodeID is taken from a position in a List of clusterNodeNames that is shared on
master election, the only thing the master does is co-ordinate that list and determine clock
offsets. That also removes any management of the range, unlike Cassandra which shards the
ID range over the cluster from the start.

The other thing I have found is necessary is to propagate version information of changes so
when a remote node receives an event it knows, that if it doesn't have the current version
locally, it knows which node within the cluster does, giving just-in-time-constent rather
than eventially-consistent.

I am sorry if this is old information, I havent been following the JR3 discussions too closely,
but I hope it helps. Once other thing you might want to have a quick look at is the Quorum
implementation in Cassandra which might be relevant to making certain areas more acid than

What you do with the data once you have the sequence and event distribution is another matter,
I guess you are thinking of distributing the content over the entire cluster rather than making
it local to each node.


View raw message