directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng, Kai" <kai.zh...@intel.com>
Subject RE: JdbmPartition repair
Date Mon, 25 Jan 2016 14:27:33 GMT
Thanks a lot for the detailed and insightful explanation. I'm not able to absorb it well because
not familiar with the codes. It will serve as very good materials when someday I need to look
into the LDAP things. The details make me believe it's very necessary to have a strong, mature,
industry proven backend for the LDAP server because the LDAP things are already kinds of complex
enough. We can't combine the LDAP logic with the storage engine, they need to be separated,
developed and tested separately. Looks like Mavibot is going in this way and sounds good to
me. What concerned me is that as we're lacking enough resources on developing it, it may still
take some time to become mature and robust. But if we leverage some existing engine, then
we can focus on the LDAP stuffs and work on some advanced features, move on a little faster
and have releases like 2.x, 3.x and so on. Sqlite yes is C, but it's supported on many platforms
and Java can use it via JNI; it's a library, can be embedded in an application. You may dislike
JNI, but only a few of APIs are going to be wrapped for the usage, and actually there're already
wonderful wrappers for Java. Like SnappyJava, the JNI layer along with the library can be
bundled within a jar file and get distributed exactly as a maven module. One thing I'm not
sure is how well the LDAP entries fit with the sql table model, but I guess there could be
pretty much investigations in this direction. The benefit would be, saving us amounts of developing
and debugging time, robust and high performance, transaction support and easy query. Some
thoughts in case any helps. Thanks.

Regards,
Kai

-----Original Message-----
From: Emmanuel Lécharny [mailto:elecharny@gmail.com] 
Sent: Monday, January 25, 2016 1:32 AM
To: Apache Directory Developers List <dev@directory.apache.org>
Subject: Re: JdbmPartition repair

Le 24/01/16 16:47, Zheng, Kai a écrit :
> Thanks Emmanuel for the sync and sharing. The approach looks pretty good, and is often
seen in many mature products. The repair process is triggered when corruption is found while
the server is running, or while restarting with a specific option? Or the both? If the repairing
stuff is not easy to integrate, maybe a repair tool like the one Kiran worked out sounds good
to use? Or during startup time checking the data/index, if not fine then Java system launching
the tool process for the fixing? Just some thoughts, in case some useful.

The corruption happens in some rare cases, and it's mostly due to some concurrent updates.
Let me explain what happens in detail, and sorry of it's a big lengthly, it has to be.

We store entries in what we call the MasterTable. Entries are serialized, and each one of
them has an associated ID (actually, we are using random UUID). So the master table is containing
tuples of <UUID,
Entry>.
Each index refers to this MasterTable using the entry UUID. Typically, let's say an entry
has an ObjectClass person, then the ObjectClass index will have a tuple <ObjectClass, Set<UUID>>
wher ethe set contains all the Entry's UUID of entrues that has the 'person' ObjectClass.

We also have one special index, the Rdn index. This one is more complex, because it is used
for two things : refering to an entry from a RDN, and also keep a track of the hierarchy.
If we have an entry which DN is ou=users,dc=example,dc=com, where dc=exmple,dc=com is the
partition's root, then the RDN index will contain two tuples for the full DN : one for the
entry itself, and one for the suffix. Actually, we don't store tuples like <Rdn, ID>,
but a more complex structure, the ParentIdAndRdn.
The reason is that we may have many entries using the same RDN. For instance :

entry 1 : cn=jSmith,ou=users,dc=example,dc=com
entry 2 : cn=jSmith,ou=administrators,dc=example,dc=com

That this jSmith is one person or two is irrelevant. The thing is that we can't associate
the RDN cn=jSmith with one single entry, so what we store is a tuple <entryId1, <cn=jSmisth,
parent entry ID>> for the first entry, and <entryId2, <cn=jSmisth, parent entry
ID>> for the second entry. The difference is that we use the parent's ID of each entry
as a discriminant. For those two entries, and their respective parents, we will have the following
tuples stored in the reversed RDN index :

<root ID, <null, dc=example,dc=com>>   : for the root entry
'dc=example,dc=com'
<users ID, <rootId, ou=users>>         : for the
'ou=users,dc=example,dc=com' entry
<admin ID, <rootId, ou=administrators> : for the 'ou=administrators,dc=example,dc=com'
entry
<entry1 ID, <users ID, cn=jSmith>>     : for the
'cn=jSmith,ou=users,dc=example,dc=com' entry
<entry2 ID, <admin ID, cn=jSmith>>     : for the
'cn=jSmith,ou=administrators,dc=example,dc=com' entry We also have a forward index, that contains
<ParentIdAndRdn, ID> tuples, like :

<<null, dc=example,dc=com>, root ID>
<<rootId, ou=users>, users ID>
<<rootId, ou=administrators>, admin ID>
<<users ID, cn=jSmith>, entry1 ID>
<<admin ID, cn=jSmith>, entry2 ID>

That is a second index we need to update for each added entry.

Why do we have the reverse index ?  You have already realized we don't store the DN in the
entry, nor do we store the full DN in the RDN index.
That mean we have to recreate this DN using the RDN index. For entry 2, assuming we have it's
ID, we must use the reverse index to do so. With the entry2 ID, we pull the <entry2 ID,
<admin ID, cn=jSmith>> tuple, that givew us the entry's parent, adminID. We now can
fetch the admin's parent from the <admin ID, <rootId, ou=administrators> tuple with
another fetch from the index, and finally with the <root ID, <null, dc=example,dc=com>>
tuple, we get the last RDNs. We can now reconstruct the full DN.

Why do we have the forward index then ? Because we need to be able to fetch an entry from
a DN, when we are processing a search (because the BaseDN is provided). We process such a
search by reading the DN, spliting it in RDNs, and for each RDN, from right to left, we search
the entry ID. Let's say we search for cn=jSmith,ou=administrators,dc=example,dc=com, we first
look in the RDN forward index with the <null, dc=example,dc=com> key, which gives us
the root ID? then we look for <root ID, ou=administrators> in the index, and get back
the admin ID? Last, not least, we look for <admin ID, cn=jSmith> and we get what we
were looking for, the entry2 ID.

As one can see, there are 3 accesses to teh Rdn index everytime we do a search, and 3 more
access when we construct the DN for the found entry (for such a DN. If teh DN size is 45,
then we will do 5 access plus 5 access).

That also mean we update the RdnIndex many times, times 2 (as we have a revert and forward
index). Many times ? You may object that we only have to add the RDN, as the parent's RDN
always exist, and you would be right. Except that we also update the count in all the parents.
The idea is that each entry keep a track of the number of its direct children and teh count
of all their descendants. This is critical for search that involve SUBTREE or ONELEVEL, as
it may help to keep the search using the best possible index, depending on the number of candidates
we have.

So bottom line, every update operation (add, delete, move) will update the RDN index DN.size()
* 2 times (the Rename operation will just update the Rdn index twice).


That beaing said, all those operations have to be done as an atomic operation : ie either
all the updates are done, or none is done. If we imagine that a crash occurs in the middle
of the RDN index update, we will never be able to reconstruct the full DN, as the RdnIndex
will not be complete.

We also have another constraint : we can't allow concurrent updates to be done, as each update
*will* update the same Rdn index, and they have to see a fully consistant database. We do
serialize the updates for this reason (ie, you can only do a write when any other write is
done).

Let's go a bit deeper : We can't either do a search while a write is being processed, because
during a write, we are reshuffling the indexes, and the searches *are* using the same indexes.

To be clear, we prioritize the reads over the writes, but when a write arrives, we put it
on hold until all the pending reads are completed, and all the newer reads or writes are also
put on hold. Then we execute the write, and now we can process the operations that where on
hold.
Reads can be processed concurrently though.


Now, we also update many other indexes : presence, alias, oneAlias, objectClass, entryCSN,
entryUUID. That for mandatory indexes. We also have all the users index (generally a few,
but it can be tens of them).
A crash in the middle of any update on any of those indexes can left the database in a dire
state. (crash or brutal shutdown).


Bottom line, Mavibot being a MVCC database, we can't corrupt it even if we brutally stop the
server, because then the intermeate state will never be visible. We see a fully consistant
state, whatever happens.

Now, back to your question : can we detect a data corruption at startup ? Not easily. It would
be costly, as we would have to read all the database, and check all the indexes. Better let
the user run it as a workaround.
>
> I'm not very sure to rewrite JDBM though I know there're plenty of reasons to do so,
as most software rewritings do. But if we have start with new, implementing something like
B+ tree, that needs transaction support, I'm wondering if we could do it by leveraging already
industry proven backend, because developing such backend may take long time effort and pretty
much of resources. I'm wondering if Sqlite could serve the purpose well or not, or how it
can be wrapped or adapted for the usage here. Again just a quick thought and in case somewhat
useful.

We have thought about our options. SqlLite is a C based database. It does not fit us, because
we want to offer an embeddable LDAP server, so we need a Java based backend.

To be franck, the very first time we discussed about getting rid of JDBM was back in 2006
(yes, 10 years ago...) during the Austin Apache conference. back then, the CouchDB was *exactly*
what we needed, except that it was Erlang based. Then we were stuck in more urgent fixes.
In 2009, during the LDAP conference, I had a porof that our implementation was borked, and
I added very heavy synchronisation to allow the LDAP server to accept concurrent - sort of
- reads an writes. In 2011, we started on modifying JDBM so that it includes some MVCC mechanisms.
We used a Write ahead Log solution, which proven to be very brittle.
Typically, when the API user was to forget to close a cursor, then the server would ends stuck
in a deadlock. Finally, we decided to go on with Mavibot back in 2012, as it was first the
solution for us, and second it would also be much faster, as we would not have to deal with
locks when reading.

The first Mavibot version works. We tested it in the servr. The second one, the one we are
currently working on, has to support transactions.
The big difference is that with transaction support, we can get rid of the locks we have put
in the server, making the updates faster.

It's all in all, just a matter of time we can dedicate on working on Mavibot.


Mime
View raw message