Mailing-List: contact dev-help@directory.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Apache Directory Developers List" <dev@directory.apache.org>
Received-SPF: pass (nike.apache.org: domain of elecharny@gmail.com designates
 72.14.220.154 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:user-agent:mime-version:to:subject
         :content-type:content-transfer-encoding;
        b=Yw0imYklvJQFmlU2x0bHrI9+xY5WzunUa6qAbHWRZW6pxzWwww3hM5ZZ+vrFGhQmxf
         /P3GlgaUHObNI+GCqdlyLNGacIHUbyNN4y9knsW/iY8WqfY5nmhHyfNomXkzQKVoK7xK
         M3z7iKx/WmRfXiE35WpQVLfsDGVHSGO3rpm5c=
Message-ID: <497274A5.9070200@nextury.com>
Date: Sun, 18 Jan 2009 01:15:33 +0100
From: Emmanuel Lecharny <elecharny@gmail.com>
User-Agent: Thunderbird 2.0.0.19 (X11/20090105)
MIME-Version: 1.0
To: Apache Directory Developers List <dev@directory.apache.org>
Subject: [Mitosis] random thoughts ...
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

After having discussed with Alex yesterday about replication, I thought 
a bit about what a replication system means, and I came to a point where 
we should not consider replication from a server to server perspective, 
but as a whole. Ok, know, it's a bit fuzzy. Let me explain what I have 
in mind.

First, let's consider that all the servers are connected and replicate 
correctly, without any kind of problem (ie, they never get disconnected, 
they are all time-synchronized, all operations have their unique 
timestamp). In this genuine case, we should consider that the full set 
of LDAP server should be seen as a unique LDAP server : everything is 
just available from any server, without any difference.

If at least one server get disconnected, the you have split this virtual 
big LDAP server in two parts : the disconnected server, and the rest of 
them. As they are still all connected, and perfectly synchronized, it's 
really like if we have one giant LDAP server again, so we are just 
facing two LDAP servers, disconnected.

If we move a bit forward, if M servers get disconnected from a group of 
N servers, then we fall back in the same situation : M is seen as a 
unique LDAP server, so is N.

One step further, if the set of servers is fragmented in many small 
disconnected sets, then each of those sets are seen a unique LDAP server.

Ok, so far so good. Where did it brought us ? I think that replication 
per se is just a matter of managing replication between 2 servers, any 
other case can fell back to this category.

Now, how do we manage replication between server A and server B 
(whatever the number of real servers present in A and B) ? Simple : as 
each operation within A or B are done on a globally connected system, 
with each operation having its unique timestamp (ie, two operations have 
two different timestamps), all the modifications done globally are 
ordered. It's just then a matter of re-ordering two lists of ordered 
operations on A and B, and to apply them from the oldest operation to 
the newest one. Let's see an example :

Server A and server B were synchronized at t0, when the connection was 
broken. Since then, many modification operations occured on both servers :

on Server A : op[1, t1], ..., op[i-1, ti-1], op[i, ti], op[i+1, ti+1], 
..., op[n-1, tn-1], op[n, tn]
on Server B : op[1, t'1], ..., op[j-1, t'j-1], op[j, t'j], op[j+1, 
t'j+1], ..., op[m-1, t'm-1], op[m, t'm]

Server A and server B are now connected back to each other. each 
modifications done on B are to be applied on A and each operations done 
on A must be applied on B. What if some of those operations are 
conflicting ? Let's just come back at t0, when both servers were 
synchronized. If we consider that the servers remained synchronized all 
along the connection breakage, then A and B would have received the 
modifications from each other at the very moment they occurred, and each 
conflict would have result to an error being sent to the client.

Let's do as if the connection never broke then :
we restore the initial state of A and B to t0 (which is possible, as we 
have the ChangeLog system, allowing us to revert to a previous state). 
Of course, we do so on both servers. Now, let's merge the modifications 
form A and B :

op[A, 1, t1], op[B, 1, t'1]..., op[B,j-1, tj-1], op[B, j, tj], op[A, 
i-1, ti-1],op[B, j+1, tj+1], op[A, i, ti], op[A, i+1, ti+1], ..., op[A, 
n-1, tn-1], op[B, m-1, t'm-1], op[B, m, t'm], op[A, n, tn]

As the operation might have occurred at different times on both server, 
they have been mixed, but in any case, as each operation are supposed to 
have a unique timestamp, the resulting list of modification is still 
order, on both servers.

Now, after having reverted to state t0, we just have to inject the 
modifications from the merged list on A and B, rejecting every 
modification which are errors. At the end, A and B will be perfectly 
synchronized, without conflicts.

Now, remember that A and B are not unique servers, but set of servers. 
It doesn't matter too much, as we can consider that all the servers in 
set A and set B are totally replicated, so they are in the very same 
state, and the merged list can just be applied the same way to any 
server from A and B.

What if we have many group of disconnected servers ? This is a bit more 
complex, but not so much. We just have to replicate the groups 2 by 2, 
or assume they are replicated 2 by 2, and at the end of a potentially 
long process, where we revert back to the time the server where 
disconnected and reapply all the merged modifications, we will be back 
in the same state for all the servers.

There are only two conditions we must met :
1) the servers must be time synchronized,
2) the modifications timestamp must be unique, whatever server they have 
been done on.

Condition 2 can easily be met with the existing CSN, if we consider that 
there is on order in the replicas (ie A < B < C, ... where A, B, C are 
the replica's id). This is purely conventional, but necessary.

Regarding condition #1, we can't guarantee that all the servers will use 
the same time. We just do our best to get this as accurate as possible.

Last, not least : the triggers. If some modification can triggers some 
other (because of integrity constraints being activated), then it should 
be logged in the change log. When replicating, the triggers _must_ be 
disabled, as the merged operations will contain all the triggered 
operations.

Ok, I'm done now. All this is of course a coarse approximation, but I 
think it's pretty close to what we nned to deal with.

Please just tell me if I'm not totally off rail, or if you think I have 
just did too much pot lately ;)

Thanks !

-- 
--
cordialement, regards,
Emmanuel L�charny
www.iktek.com
directory.apache.org