directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre-Arnaud Marcelot ...@marcelot.net>
Subject Re: Current status of the server, and mvbt-branch status
Date Fri, 21 Sep 2012 10:05:16 GMT
Very interesting improvements.

Love the increase in performance and simplification in class hierarchy.

+1 for the merge.

Regards,
Pierre-Arnaud


On 19 sept. 2012, at 16:34, Emmanuel L├ęcharny <elecharny@gmail.com> wrote:

> Last year in september, I detected a problem when conducting concurrent searches and
modifications [1]. This problem is a real show-stopper.
> 
> The problem was that as we were fetching the entries one by one, using indexes, if some
modification occurs and impacts the index we are using while the search is being processed,
we may not be able to get back the values we are pointing to in the search. This results to
NPE, at best.
> 
> We looked for ways to solve this issue, and a few of them are available.
> 1) The first solution would be to implement a in-memory MVCC system, with a write-ahead
log. The idea is to keep in memory all the modified values until all the pending searches
are completed. They can then be removed from memory. This solution would be cross-partition.
> 2) The second solution would be to use MVCC backends. As we don't remove any element
unless they are not used anymore, this is a guarantee that we will always be able to fetch
an element from the backend, even if it has been modified. We just will get back an old revision.
> 3) A third solution, which does not imply that we implement a MVCC global system, or
a MVCC backend, would be to serialize writes and reads. Reads would be concurrent, and writes
can only be done when all the pending reads and writes are completed. Of course, if a read
is received while a write is being processed, this read will have to wait for the write completion.
In order to keep performance good, we need to speed up the reads so that writes are not delayed
for ever : this is done by computer the set of candidates immediately, before fetching the
entries one by one later. If an entry has been removed whil we fetch it, then this entry will
simply be ignored.
> 
> Those three solutions have pros and cons. We will analyze them in a further paragraph.
> 
> I wanted to play around the 2nd idea, and as I needed a in-memory MVCC BTree for that,
I wrote a prototype BTree for that purpose (Mavibot [2]).
> 
> While I thought Mavibot was ready (more or less), I created a new branch (apacheds-mvbt)
to experiment the new backend. I had to do some cleanup in order to be able to use the Mavibot
implementation :
> 
> o Use UUID instead of Long for Entry's identifiers
> o Simplify the cursor implementation, and use of generic [3]
> 
> 
> After a few days, I found that some technical choice we have made are problematic, and
I went a bit further.
> 
> 1) The cursor Interface suppose you can move forward and backward. Sadly, when you have
a search using a filter with an 'OR' connector, there is no way you can move backward. Not
that it's really a problem in a standard usage of LDAP, but that would defeat the VLV control.
> 
> 2) Not fetching entries immediately would potentially save time, if the entry is not
a valid candidate against the filter we use. However, using indexes to check if an entry is
valid or not mans we go down a BTree, which is also a costly operation. If we have a filter
with N indexed attributes, we will go down N BTrees, for each candidate, instead of fetching
the entry once, and simply validate it against the filters using comparators.
> 
> 3) We were using generics extensively, in a way that makes it extremely confusing to
know what is what (having 3 parameters for an IndexCursor does not help at all)
> 
> 4) We were using reverse table for every index, even when it was useless.
> 
> 5) The Index initialization was problematic
> 
> 6) Once we are using UUID instead of Long, there is no reason to ue the entryUUID index
anymore, we can do a full scan using the MasterTable directly
> 
> 7) The Cursor hierarchy was not fully consistent
> 
> 8) Entry were fetched more than once in complex filters, slowing down badly the procssing
of such searches
> 
> 9) Some evaluators could be removed sparing some CPU when the number of candidate they
were selecting was 0
> 
> 10) Some Or and And cursor could be removed if they were having only one candidate
> 
> All those items have been fixed in the mvbt-branch.
> 
> Then, before implementing Mavibot, I found that we could implement the 3rd strategy easily.
The idea is to allow concurrent searches, but writes would be serialized.
> 
> The problem was that with the way we were using cursors, we may browse a BTree for a
very long time - especially when processing a PagedSearch - and we can't expect the writes
to wait until such a search is completed. Another solution was to gather all the candidates
*before* fetching any entry, and to store their associated UUID into a list of candidates.
> 
> This is not any different from how we were conducting searches with a OR in the filter
: it also gathers all the candidate UUID up to the end of the search.
> 
> Doing so offers many advantage :
> - we can exit the search faster, with all the candidates already selected (even for a
paged search), and the writes can be applied more frequently
> - we speedup searches where more than one filter can be used to select the valid candidates
> - now that the set has been created, we can move forward *and* backward, something we
can't do wth the current implementation when we use an OR/AND connector in the filter.
> 
> There are some cons though :
> - gathering candiates UUID may eat memory
> - in some case, it may be slower than the current implementation, just because the number
of selected candidate may be way higher than the number of entries we will return (we pay
the cost of extra entry fetch in this case)
> 
> But even then, we already have to keep a set of candidate UUIDs when doing a search with
an OR/AND connector in the current implementation, and it' snot proven that fetching an entry
is more costly than fetching a <key/value> of a given index - not to mention that the
math may change a lot of we have more than one Node in the filter-
> 
> Now, in order to fix the concurrent issue we have found last year, we have to add a couple
of ReadWriteLock :
> o one in the OperationManager, to forbid writes to be executed while some searches are
handling a ReadLock and to forbid a search to be executed while a write handles a WriteLock
> o another one in the AbstractBtreePartition to protect the master table from concurrent
access, as the searches will directly fetch the entries from the MasterTable - and the RdnIndex
-, even when some writes are updating them.
> 
> The current branch implements this algorithm. It works, and it's fast. I have tested
the concurrent modifications and searches that demonstrated the issue last year, I was able
to run 10 000 loops over 20 threads, each one doing :
> - add cn=test<N>
> - search subtree cn=test*
> - delete cn= test<N>
> where N is the thread number.
> 
> This test was badly failing on trunk (the more threads we have, the more errors we get),
and is now running just fine.
> 
> (The test was failing because while doing the search, we were fetching entries from the
backend, when some deletion were being done at the same time).
> 
> One more thing : performances. The branch is actually faster than the current trunk.
A search conducted using the SearchPerfIT test.
> 
> Here are the performances I get when using the embedded LDAP server :
> 
> 1.5.0 (FTR) :
> - Object   Search ->  32 993/s
> - OneLevel Search -> 156 935/s
> - SubLevel Search -> 137 225/s
> 
> Trunk :
> - Object   Search ->  57 191/s
> - OneLevel Search ->  76 250/s
> - SubLevel Search ->  77 200/s
> 
> MVBT branch :
> - Object   Search ->  76 540/s (+33%/trunk)
> - OneLevel Search -> 109 985/s (+44%/trunk)
> - SubLevel Search -> 134 750/s (+74%/trunk)
> 
> We get back one entry for the ObjectSearch, 5 for the OneLevel search and 10 for teh
SubLevel search. What we measure is the number of entries we get back per second.
> 
> The 1.5.0 version is actually faster for OneLevel searches and SubLevel searches because
we were storing the DN in the entry - rebuilding the Dn is a costly operation -.
> 
> The gain is ever higher when we use a complex filter, due to the numerous and spurious
fetches of the same entry in trunk.
> 
> The same tests but using the network :
> MVBT branch :
> Object :  2 564/s
> One    : 17 960/s
> Sub    : 23 050/s
> 
> Trunk :
> Object :  2 567/s
> One    : 12 555/s
> Sub    : 21 240/s
> 
> Here, the gain i sless significant, mainly because we have the client and the server
on the same machine, and the messages encoding/decoding is shadowing the gain in the backend.
But still, we see some gap in the oneLevel and subLevel searches (between 8% up to 43%).
> 
> At this point, I think we have a working server, solving a critical issue we are fighting
with for more than one year now, and it's actually faster than the current trunk. It also
solves some other issues.
> 
> Before continuing with the Mavibot backend implementation - which will brings some major
performance boosts, AFAICT - I could suggest we promote the current mvbt branch as the trunk,
and cut a 2.0.0-M8 release asap.  I do think that it's a better, more stable, and faster working
base for the upcoming versions, and mor eimportant, the 2.0 GA.
> 
> We shoudl also continue the ongoing work on other solutions, as they offer some extra
advantages, but we may have them in a 2.1 or a later version.
> 
> so, wdyt ?
> 
> -- 
> Regards,
> Cordialement,
> Emmanuel L├ęcharny
> www.iktek.com
> 


Mime
View raw message