From commits-return-20874-apmail-directory-commits-archive=directory.apache.org@directory.apache.org Tue Jan 13 14:58:30 2009 Return-Path: Delivered-To: apmail-directory-commits-archive@www.apache.org Received: (qmail 42234 invoked from network); 13 Jan 2009 14:58:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Jan 2009 14:58:30 -0000 Received: (qmail 35357 invoked by uid 500); 13 Jan 2009 14:58:30 -0000 Delivered-To: apmail-directory-commits-archive@directory.apache.org Received: (qmail 35316 invoked by uid 500); 13 Jan 2009 14:58:30 -0000 Mailing-List: contact commits-help@directory.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@directory.apache.org Delivered-To: mailing list commits@directory.apache.org Received: (qmail 35307 invoked by uid 99); 13 Jan 2009 14:58:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jan 2009 06:58:30 -0800 X-ASF-Spam-Status: No, hits=-1994.3 required=10.0 tests=ALL_TRUSTED,HTML_MESSAGE,MIME_HTML_ONLY X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jan 2009 14:58:20 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 0B3DF234C4AE for ; Tue, 13 Jan 2009 06:58:00 -0800 (PST) Message-ID: <1191599999.1231858680021.JavaMail.www-data@brutus> Date: Tue, 13 Jan 2009 06:58:00 -0800 (PST) From: confluence@apache.org To: commits@directory.apache.org Subject: [CONF] Apache Directory Server v1.5: Mitosis Development Guide (page edited) MIME-Version: 1.0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org

Mitosis Development Guide has been edited by Martin Alderson (Jan 13, 2009).

=20

(View changes)

Content:

Table of content

Glossary

Term Description
CSN Change Sequence Number
MMR Multi-Master Replication <= /td>
UUID Universally Unique IDentif= ier

Replication= analysis

Base operations<= /h2>

Replication is meant to transpose a modification done on one server into= the associated servers. We should also insure that a modification done on = an entry in more than one server does not lead to inconsistencies.

As the remote servers may not be available, due to network conditions, w= e also have to wait for the synchronization to be done before we can valida= te a full replication for an entry. For instance, if we delete an entry on = server A, it can be deleted for real only when all the remote servers has c= onfirmed that the deletion was successful.

Data structure

CSN and UUID

We will use two tags, stored within each entry, to manage the replicatio= n. The CSN (Change Sequence Number) stores when and where (which ser= ver) the entry was last modified. A replicated entry on 3 servers will have= the same CSN. Before replication they may be different. The UUID= (Universal Unique Identifier) is associated with an entry, and only on= e. So if we have an entry replicated on 3 servers, it will have one CSN<= /b> (as the entry is at the same version for all servers) and only one U= UID (as it's the same entry). The UUID is not currently used. = The CSN stored in the entry is used to prevent older modifications o= verwriting newer ones. Unfortunately this leads to inconsistent servers (s= ee DIRSE= RVER-8943D"") - we need to check the CSN for each attribute instead= . Once this is fixed the CSN stored on each entry will no longer be= used.

CSN structure

A CSN is a composition of a timestamp, a replica ID and a operation sequ= ence number. It's described in The LDAP Change Sequence Number3D"". W= e have defined a simpler version, as the current RFC is still a draft, wher= e we use a unique operationSequence instead of two integers (timeCount and = changeCount) to disambiguate entries changed at the same time.

As the timestamp is computed using a System.currentTimeMillis()= call, the accuracy is around 10 ms. We may have at hundreds of changes don= e in this interval. This is the reason we have a additional operationSe= quence number.

The CSN class structure is described by the following schema :

Basically, from the user POV, a CSN syntax is [timestamp:replicaId:op= erationSequence]

UUID structure

We use Java 5 UUID implementation, which is based on variant 2 of RFC 41223D""

Network

As Mitosis is a multi-master replication system, so each server has to b= e connected to the server it replicates with, and accept incoming connectio= ns from those servers.

We have two components :

    =09
  • an Acceptor, for incoming replication operations
  • =09
  • N connectors, one per connected server.

The biggest problem we have is to connect to remote servers. As a starti= ng server will have to reconnect to the remote servers, we will two problem= s :

    =09
  • if the remote server is also starting, but has not yet established i= s listener, we won't be able to establish the connection
  • =09
  • if the servers are not time synchronized, we may not be able to corr= ectly replicate a time based operation.

Network i= nitialization

When a server starts, after having initialized the internal LDAP service= , it has to start the network layer. The following algorithm is used :

start the Acceptor

set a retry interval to 2 seconds

until each remote server is connected do
  for each not connected remote replica=
 do
    start a connector

    if the connection is established
      remove it from the list of disconnected server
  done

  if we have unconnected remote server
    double the retry interval
  else
    exit
done

Basically, we try to connect to a remote server, and if we don't success= , we wait for an increasing period of time before retrying. When we reach 6= 0 seconds for this interval, we stop increasing the interval and simply try= every minute.

Obviously, this is costly, and fragile, as a broken connection has to be= detected and immediately restored, otherwise we can't replicate. Plus we d= on't manage scheduled downtime, as the server still tries to connect to the= shutdown server even if it's on purpose.

Plus we have to store the pending operation until the connection is re-e= stablished.

3D""

Another approach would be to rely on a asynchronous system (Messages) to= handle the server to server communication. The biggest advantage would be = to rely on an proven system to manage connection and retries, instead of co= ding our own system inside ADS, with all the burden it brings. ActiveMQ cou= ld be a good option.

Store

We use a Database to store pending operations.

Database stru= cture

We use 3 tables : REPLICATION_METADATA, REPLICATION_UUID a= nd REPLICATION_LOG.

3D""

The "REPLICATION_" prefix can be configured, for instance if one want to= define more than one ADS locally. It would make more sense to use the repl= icaID instead of this prefix, though.

RE= PLICATION_METADATA structure

field type Primary key description
M_KEY VARCHAR(30) NOT NULL Yes  
M_VALUE VARCHAR(100) NOT NULL No  

REPLIC= ATION_UUID structure

field type Primary key description
UUID CHAR(36) NOT NULL Yes The entry UUID
DN CLOB NOT NULL No The entry DN

REPLICA= TION_LOG structure

field type Primary key description
CSN_REPLICA_ID VARCHAR(16) NOT NULL Yes The replica ID
CSN_TIMESTAMP BIGINT NOT NULL Yes The Timestamp
CSN_OP_SEQ INTEGER NOT NULL Yes The op sequence
OPERATION BLOB NOT NULL no The replication operation

Operation stor= age

Each operation is logged into the REPLICATION_LOG table. It has to be se= rialized first to be put into the OPERATION field. The CSN is spread in thr= ee columns for a better search.

The serialized Operation structure will depend on the operation. In any = case, its a triplet <OpType, CSN, [serialized op]>, where the [serial= ized op] can be composite. For instance, we may have something like <OpT= ype, CSN, <OpType, CSN, entry> <Optype, CSN, entry>> if we d= eal with a composite operation.

The AddEntry operation is serialized as <OpType, CSN, Entry>
The XXXAttribute operations are serialized as <OpType, CSN, <dn, id, = attribute>>
The composite operations are serialized as <OpType, CSN, list of seriali= zed children>

3D""

The CSN is already a part of the entry for an Add operation, it's not ne= cessary to serialize it.
The Id is already stored in the Attribute for every attribute operations, w= e can avoid serializing it.

3D""

We use a specific serialization, Java based, to store the operation. It = would be way better to store a LDIF, assuming we always consider the CSN as= a modified attribute.

Configuration

The replication system is a Multi-Master replication, ie, each server ca= n update any server it is connected to. The way you tell a server to replic= ate to others is simple :

<replicationInterceptor>
      <configuration>
        <replicationConfiguration logMaxAge=3D"5"
                                  replicaId=3D"i=
nstance_a"
                                  replicationInterval=3D"2"
                                  responseTimeout=3D"10"
                                  serverPort=3D"=
10390">
          <s:property name=3D"peerReplicas">
            <s:set>
              <s:value>instance_b@localhost:1234</s:value>
              <s:value>instance_c@localhost:1234</s:value>
            </s:set>
          </s:property>
        </replicationConfiguration>
      </configuration>
    </replicationInterceptor>

Here, for the server instance_a" we have associated two replicas : &#= 42;instance_b and instance_c. Basically, you just give the list = of remote server you want to be connected to.

The r= eplication interceptor

The MITOSIS service is implemented as an interceptor in the current vers= ion (1.5.4). The following operations are handled :

    =09
  • add
  • =09
  • delete
  • =09
  • hasEntry
  • =09
  • list
  • =09
  • lookup
  • =09
  • modify
  • =09
  • move
  • =09
  • moveAndRename
  • =09
  • rename
  • =09
  • search

The hasEntry, list, lookup and search operations are only handled to pre= vent tombstoned (deleted) entries being returned.

Inter= ceptor initialization

When the interceptor is injected into the chain, its init() method is ca= lled, and it will initialize the full replication system. Here are the step= s the init() method goes through :

    =09
  1. Validate the replication configuration
  2. =09
  3. Initialize the store
  4. =09
  5. Start the CSNFactory
  6. =09
  7. Start the networking sub-system
  8. =09
  9. Purge the aged data from the store

Then the service is ready to process new operations.

3D""

The purge of old data is not done atm unless the server is restarted. It= has to be completed. We have a quartz job (ReplicationLogCleanJob) for th= is but it isn't scheduled by default - see Quartz Schedular Integration.

Operations cl= asses

We are using Operation objects to manage replications inside the = interceptor. Here is the Operation classes hierarchy :

Each of the interceptor's method handling an entry modification will use= one of those classes to store the resulting modification.

Operations

Add operation

It creates a AddEntryOperation object, with a ADD_ENTRY operation= type (how useful is it, considering that we are already defined a specific= class for such an operation ???), an entry and a CSN.

The newly created entry will contain two new AttributeType :

    =09
  • an entryUUID with a newly generated UUID
  • =09
  • an entryDeleted set to FALSE

If the added entry already exists in the current server, then we should = consider that the entry can't be added.

3D""

Currently, we check for more than the existence of the entry in the base= . Either the entry is absent, and we can add it, or it's present, and we sh= ould discard the new entry, throwing an error.

Or another option is to consider that the entry has been created on more= than one remote server, and then been created locally. We may have to repl= ace the old entry by the new one, even if they are different. This is the c= urrent implementation.

3D""

What if the entry already exists, but with a pending 'deleted' state ? T= his has to be checked.

Delete operatio= n

It creates a CompositeOperation object, which contains a Repla= ceAttributeOperation, as the entry is not deleted, but instead a ent= ryDeleted AttributeType is added to the entry, and a ReplaceAttribut= eOperation containing the injection of a entryCSN AttributeType,= with a newly created CSN.

So here are the operation content :

    =09
  • ReplaceAttributeOperation
  • =09
  • entryDeleted, value TRUE
  • =09
  • ReplaceAttributeOperation
  • =09
  • entryCSN, with a new CSN
3D""

As we may receive a Add request from a remote server - per replication a= ctivation -, we currently create so called glue-entries. There a= re necessary if we consider that an entry is added when the underlaying tre= e is absent. This can happen in a MMR scenario where those missing entries = have not been received yet, but the leaves have been.

3D""

The delete operation should be a simple attribute Modification. Currentl= y, two requests are sent to the backend (one for each added attribute), whi= ch is useless.