directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny <>
Subject Current status of the server, and mvbt-branch status
Date Wed, 19 Sep 2012 14:34:15 GMT
Last year in september, I detected a problem when conducting concurrent 
searches and modifications [1]. This problem is a real show-stopper.

The problem was that as we were fetching the entries one by one, using 
indexes, if some modification occurs and impacts the index we are using 
while the search is being processed, we may not be able to get back the 
values we are pointing to in the search. This results to NPE, at best.

We looked for ways to solve this issue, and a few of them are available.
1) The first solution would be to implement a in-memory MVCC system, 
with a write-ahead log. The idea is to keep in memory all the modified 
values until all the pending searches are completed. They can then be 
removed from memory. This solution would be cross-partition.
2) The second solution would be to use MVCC backends. As we don't remove 
any element unless they are not used anymore, this is a guarantee that 
we will always be able to fetch an element from the backend, even if it 
has been modified. We just will get back an old revision.
3) A third solution, which does not imply that we implement a MVCC 
global system, or a MVCC backend, would be to serialize writes and 
reads. Reads would be concurrent, and writes can only be done when all 
the pending reads and writes are completed. Of course, if a read is 
received while a write is being processed, this read will have to wait 
for the write completion. In order to keep performance good, we need to 
speed up the reads so that writes are not delayed for ever : this is 
done by computer the set of candidates immediately, before fetching the 
entries one by one later. If an entry has been removed whil we fetch it, 
then this entry will simply be ignored.

Those three solutions have pros and cons. We will analyze them in a 
further paragraph.

I wanted to play around the 2nd idea, and as I needed a in-memory MVCC 
BTree for that, I wrote a prototype BTree for that purpose (Mavibot [2]).

While I thought Mavibot was ready (more or less), I created a new branch 
(apacheds-mvbt) to experiment the new backend. I had to do some cleanup 
in order to be able to use the Mavibot implementation :

o Use UUID instead of Long for Entry's identifiers
o Simplify the cursor implementation, and use of generic [3]

After a few days, I found that some technical choice we have made are 
problematic, and I went a bit further.

1) The cursor Interface suppose you can move forward and backward. 
Sadly, when you have a search using a filter with an 'OR' connector, 
there is no way you can move backward. Not that it's really a problem in 
a standard usage of LDAP, but that would defeat the VLV control.

2) Not fetching entries immediately would potentially save time, if the 
entry is not a valid candidate against the filter we use. However, using 
indexes to check if an entry is valid or not mans we go down a BTree, 
which is also a costly operation. If we have a filter with N indexed 
attributes, we will go down N BTrees, for each candidate, instead of 
fetching the entry once, and simply validate it against the filters 
using comparators.

3) We were using generics extensively, in a way that makes it extremely 
confusing to know what is what (having 3 parameters for an IndexCursor 
does not help at all)

4) We were using reverse table for every index, even when it was useless.

5) The Index initialization was problematic

6) Once we are using UUID instead of Long, there is no reason to ue the 
entryUUID index anymore, we can do a full scan using the MasterTable 

7) The Cursor hierarchy was not fully consistent

8) Entry were fetched more than once in complex filters, slowing down 
badly the procssing of such searches

9) Some evaluators could be removed sparing some CPU when the number of 
candidate they were selecting was 0

10) Some Or and And cursor could be removed if they were having only one 

All those items have been fixed in the mvbt-branch.

Then, before implementing Mavibot, I found that we could implement the 
3rd strategy easily. The idea is to allow concurrent searches, but 
writes would be serialized.

The problem was that with the way we were using cursors, we may browse a 
BTree for a very long time - especially when processing a PagedSearch - 
and we can't expect the writes to wait until such a search is completed. 
Another solution was to gather all the candidates *before* fetching any 
entry, and to store their associated UUID into a list of candidates.

This is not any different from how we were conducting searches with a OR 
in the filter : it also gathers all the candidate UUID up to the end of 
the search.

Doing so offers many advantage :
- we can exit the search faster, with all the candidates already 
selected (even for a paged search), and the writes can be applied more 
- we speedup searches where more than one filter can be used to select 
the valid candidates
- now that the set has been created, we can move forward *and* backward, 
something we can't do wth the current implementation when we use an 
OR/AND connector in the filter.

There are some cons though :
- gathering candiates UUID may eat memory
- in some case, it may be slower than the current implementation, just 
because the number of selected candidate may be way higher than the 
number of entries we will return (we pay the cost of extra entry fetch 
in this case)

But even then, we already have to keep a set of candidate UUIDs when 
doing a search with an OR/AND connector in the current implementation, 
and it' snot proven that fetching an entry is more costly than fetching 
a <key/value> of a given index - not to mention that the math may change 
a lot of we have more than one Node in the filter-

Now, in order to fix the concurrent issue we have found last year, we 
have to add a couple of ReadWriteLock :
o one in the OperationManager, to forbid writes to be executed while 
some searches are handling a ReadLock and to forbid a search to be 
executed while a write handles a WriteLock
o another one in the AbstractBtreePartition to protect the master table 
from concurrent access, as the searches will directly fetch the entries 
from the MasterTable - and the RdnIndex -, even when some writes are 
updating them.

The current branch implements this algorithm. It works, and it's fast. I 
have tested the concurrent modifications and searches that demonstrated 
the issue last year, I was able to run 10 000 loops over 20 threads, 
each one doing :
- add cn=test<N>
- search subtree cn=test*
- delete cn= test<N>
where N is the thread number.

This test was badly failing on trunk (the more threads we have, the more 
errors we get), and is now running just fine.

(The test was failing because while doing the search, we were fetching 
entries from the backend, when some deletion were being done at the same 

One more thing : performances. The branch is actually faster than the 
current trunk. A search conducted using the SearchPerfIT test.

Here are the performances I get when using the embedded LDAP server :

1.5.0 (FTR) :
- Object   Search ->  32 993/s
- OneLevel Search -> 156 935/s
- SubLevel Search -> 137 225/s

Trunk :
- Object   Search ->  57 191/s
- OneLevel Search ->  76 250/s
- SubLevel Search ->  77 200/s

MVBT branch :
- Object   Search ->  76 540/s (+33%/trunk)
- OneLevel Search -> 109 985/s (+44%/trunk)
- SubLevel Search -> 134 750/s (+74%/trunk)

We get back one entry for the ObjectSearch, 5 for the OneLevel search 
and 10 for teh SubLevel search. What we measure is the number of entries 
we get back per second.

The 1.5.0 version is actually faster for OneLevel searches and SubLevel 
searches because we were storing the DN in the entry - rebuilding the Dn 
is a costly operation -.

The gain is ever higher when we use a complex filter, due to the 
numerous and spurious fetches of the same entry in trunk.

The same tests but using the network :
MVBT branch :
Object :  2 564/s
One    : 17 960/s
Sub    : 23 050/s

Trunk :
Object :  2 567/s
One    : 12 555/s
Sub    : 21 240/s

Here, the gain i sless significant, mainly because we have the client 
and the server on the same machine, and the messages encoding/decoding 
is shadowing the gain in the backend. But still, we see some gap in the 
oneLevel and subLevel searches (between 8% up to 43%).

At this point, I think we have a working server, solving a critical 
issue we are fighting with for more than one year now, and it's actually 
faster than the current trunk. It also solves some other issues.

Before continuing with the Mavibot backend implementation - which will 
brings some major performance boosts, AFAICT - I could suggest we 
promote the current mvbt branch as the trunk, and cut a 2.0.0-M8 release 
asap.  I do think that it's a better, more stable, and faster working 
base for the upcoming versions, and mor eimportant, the 2.0 GA.

We shoudl also continue the ongoing work on other solutions, as they 
offer some extra advantages, but we may have them in a 2.1 or a later 

so, wdyt ?

Emmanuel L├ęcharny

View raw message