Mailing-List: contact dev-help@directory.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Apache Directory Developers List" <dev@directory.apache.org>
Received-SPF: neutral (athena.apache.org: local policy)
Message-ID: <47573B92.4030507@planetquake.com>
Date: Thu, 06 Dec 2007 00:00:18 +0000
From: Martin Alderson <equim@planetquake.com>
User-Agent: Thunderbird 2.0.0.0 (Windows/20070326)
MIME-Version: 1.0
To: Apache Directory Developers List <dev@directory.apache.org>
Subject: Re: [ApacheDS][Mitosis] Replication data
References: <47460E73.5030907@planetquake.com>
	 <a32f6b020712011249p4c72f277w44f87f35a026ee28@mail.gmail.com>
	 <4751D277.3070603@symas.com>
	 <a32f6b020712011539s3e73ef77r3b9931b051257e11@mail.gmail.com>
	 <4752003F.6010806@symas.com>	 <16D88941E6869E1D2D68DFC0@deus-ex.zimbra.com>
 <a32f6b020712011827y581a475fwf80377edeca62e25@mail.gmail.com>
In-Reply-To: <a32f6b020712011827y581a475fwf80377edeca62e25@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit


Thanks for the responses, all.

Apologies for the delay in getting back to you - having a family problem 
at the moment so have very little spare time.

I thought having the replication logs stored in LDAP sounded nice - for 
new replicas we have to send all replicatable entries but after that the 
log LDAP entries can be sent instead.  It would be pretty much the same 
code logic and it just seemed to solve all the problems with a large 
amount of code re-use.  I was worried about possible performance hits 
though and it sounds like you (Alex) don't want to store the logs in 
LDAP for the same reason.

My main reasons for suggesting storing the logs in LDAP are:
1. So we can have optional attributes in each log entry.  This is needed 
when we "explode" the current message blob so it can be queried 
efficiently.  With JDBM I guess we would have to specify a new table for 
each type of message.
2. To reduce the code complexity.  We would have virtually the same code 
for sending whole entries as sending the logs and we would have less 
code for dealing with the data storage in general.
3. To reduce the current tight coupling with the backend database.  By 
using LDAP as the abstraction layer we could leverage ApacheDS' existing 
mechanism for specifying the data store.
4. To allow an easy way to view the logs.
5. It seems to be the most natural fit.  Since we need to store (part 
of) an LDAP entry in the logs, why not store it in LDAP?

I'll take another stab at explaining that: we already have code to store 
LDAP entries in a database, so why would we want to write that again?


 > Oh this reminds me that we also need to make sure we're generating
 > UUIDs all the time even if replication is not enabled.

Yeah, we have a JIRA about this: 
https://issues.apache.org/jira/browse/DIRSERVER-776


 >> The biggest concern I have for this is the inflexibility of LDAP
 >> searches. Do we have a sort control in ApacheDS?
 > What types of searches do you envision performing, for which LDAP
 > is too inflexible? OpenLDAP's syncrepl can be pretty much entirely
 > mapped onto plain search operations. We gain a lot of versatility
 > by keeping things generic.

We need to search for log entries beyond a certain CSN and have the 
results ordered based on CSN.  I guess if the results are always 
returned in creation date order then it might not be an issue (I'm not 
yet sure what ApacheDS does or what the LDAP standard says).  Currently 
we also find the current CSN vector by just getting the most recent log 
- we do this by performing a search with inverted sort by CSN and 1 
result maximum.  Also, if we have the attributes in a child entry of the 
actual log entry as I suggested we would need to specify a parent-child 
relationship in the search.


 > Active Directory has a lot of misfeatures... Having spent a couple
 > weeks of "quality time" with it, the flaws just leap out... Do you
 > really like the idea of carrying obsolete info around and needing a
 > sweep task to go thru and clean up periodically?

My thoughts exactly.  I'll try and do some more research here and dig 
out the reason AD uses them, but I think I'll leave that to a separate 
thread.


 > For example you have a delete of a node occur right when you add a
 > child to it.  The server would probably put the child into some
 > lost and found area and alert the administrator.  With tombstoning
 > you can easily resuscitate the deleted parent and move the child
 > back under it.

Resuscitating a deleted entry seems like something most people wouldn't 
want.  If we are attempting to simulate a single server as much as 
possible (which is my main aim) then the new child entry should be 
deleted when the peers synchronise.  As you said, we could have an 
optional lost and found area for cases where conflict resolution causes 
data loss like this, along with optional notifications to an administrator.


 >> Also our MMR support is still immature, we don't yet do value-level
 >> conflict resolution.
 > Yeash we have yet to consider that.

We will have this once I have fixed 
https://issues.apache.org/jira/browse/DIRSERVER-894.


 > The trick to get from basic single-master to basic (entry-level
 > only) multi-master is just to store multiple contextCSNs - one for
 > each peer master, and ignore entry updates that are older than an
 > entry's current entryCSN. The other requirement here is that you
 > have reliable, tightly synced clocks, otherwise the conflict
 > resolution policy falls apart.

That's exactly how our replication module works at the moment except we 
just send the changes rather than the whole entry.  I am currently 
looking at improving the way we store the logs so we can efficiently do 
attribute value level conflict resolution.  I suspect that I will end up 
with something very similar to delta-syncrepl.  I will try and dig out 
some information on that from the openldap mailing list.

Martin