Mailing-List: contact dev-help@directory.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Apache Directory Developers List" <dev@directory.apache.org>
Received-SPF: pass (nike.apache.org: domain of pajbam@gmail.com designates
 209.85.212.170 as permitted sender)
Sender: Pierre-Arnaud Marcelot <pajbam@gmail.com>
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\))
Subject: Re: Current status of the server, and mvbt-branch status
From: Pierre-Arnaud Marcelot <pa@marcelot.net>
In-Reply-To: <5059D7E7.7080705@gmail.com>
Date: Fri, 21 Sep 2012 12:05:16 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <0E3DAE36-C247-4E34-860F-7B722E13B911@marcelot.net>
References: <5059D7E7.7080705@gmail.com>
To: "Apache Directory Developers List" <dev@directory.apache.org>,
 elecharny@apache.org

Very interesting improvements.

Love the increase in performance and simplification in class hierarchy.

+1 for the merge.

Regards,
Pierre-Arnaud


On 19 sept. 2012, at 16:34, Emmanuel L=E9charny <elecharny@gmail.com> =
wrote:

> Last year in september, I detected a problem when conducting =
concurrent searches and modifications [1]. This problem is a real =
show-stopper.
>=20
> The problem was that as we were fetching the entries one by one, using =
indexes, if some modification occurs and impacts the index we are using =
while the search is being processed, we may not be able to get back the =
values we are pointing to in the search. This results to NPE, at best.
>=20
> We looked for ways to solve this issue, and a few of them are =
available.
> 1) The first solution would be to implement a in-memory MVCC system, =
with a write-ahead log. The idea is to keep in memory all the modified =
values until all the pending searches are completed. They can then be =
removed from memory. This solution would be cross-partition.
> 2) The second solution would be to use MVCC backends. As we don't =
remove any element unless they are not used anymore, this is a guarantee =
that we will always be able to fetch an element from the backend, even =
if it has been modified. We just will get back an old revision.
> 3) A third solution, which does not imply that we implement a MVCC =
global system, or a MVCC backend, would be to serialize writes and =
reads. Reads would be concurrent, and writes can only be done when all =
the pending reads and writes are completed. Of course, if a read is =
received while a write is being processed, this read will have to wait =
for the write completion. In order to keep performance good, we need to =
speed up the reads so that writes are not delayed for ever : this is =
done by computer the set of candidates immediately, before fetching the =
entries one by one later. If an entry has been removed whil we fetch it, =
then this entry will simply be ignored.
>=20
> Those three solutions have pros and cons. We will analyze them in a =
further paragraph.
>=20
> I wanted to play around the 2nd idea, and as I needed a in-memory MVCC =
BTree for that, I wrote a prototype BTree for that purpose (Mavibot =
[2]).
>=20
> While I thought Mavibot was ready (more or less), I created a new =
branch (apacheds-mvbt) to experiment the new backend. I had to do some =
cleanup in order to be able to use the Mavibot implementation :
>=20
> o Use UUID instead of Long for Entry's identifiers
> o Simplify the cursor implementation, and use of generic [3]
>=20
>=20
> After a few days, I found that some technical choice we have made are =
problematic, and I went a bit further.
>=20
> 1) The cursor Interface suppose you can move forward and backward. =
Sadly, when you have a search using a filter with an 'OR' connector, =
there is no way you can move backward. Not that it's really a problem in =
a standard usage of LDAP, but that would defeat the VLV control.
>=20
> 2) Not fetching entries immediately would potentially save time, if =
the entry is not a valid candidate against the filter we use. However, =
using indexes to check if an entry is valid or not mans we go down a =
BTree, which is also a costly operation. If we have a filter with N =
indexed attributes, we will go down N BTrees, for each candidate, =
instead of fetching the entry once, and simply validate it against the =
filters using comparators.
>=20
> 3) We were using generics extensively, in a way that makes it =
extremely confusing to know what is what (having 3 parameters for an =
IndexCursor does not help at all)
>=20
> 4) We were using reverse table for every index, even when it was =
useless.
>=20
> 5) The Index initialization was problematic
>=20
> 6) Once we are using UUID instead of Long, there is no reason to ue =
the entryUUID index anymore, we can do a full scan using the MasterTable =
directly
>=20
> 7) The Cursor hierarchy was not fully consistent
>=20
> 8) Entry were fetched more than once in complex filters, slowing down =
badly the procssing of such searches
>=20
> 9) Some evaluators could be removed sparing some CPU when the number =
of candidate they were selecting was 0
>=20
> 10) Some Or and And cursor could be removed if they were having only =
one candidate
>=20
> All those items have been fixed in the mvbt-branch.
>=20
> Then, before implementing Mavibot, I found that we could implement the =
3rd strategy easily. The idea is to allow concurrent searches, but =
writes would be serialized.
>=20
> The problem was that with the way we were using cursors, we may browse =
a BTree for a very long time - especially when processing a PagedSearch =
- and we can't expect the writes to wait until such a search is =
completed. Another solution was to gather all the candidates *before* =
fetching any entry, and to store their associated UUID into a list of =
candidates.
>=20
> This is not any different from how we were conducting searches with a =
OR in the filter : it also gathers all the candidate UUID up to the end =
of the search.
>=20
> Doing so offers many advantage :
> - we can exit the search faster, with all the candidates already =
selected (even for a paged search), and the writes can be applied more =
frequently
> - we speedup searches where more than one filter can be used to select =
the valid candidates
> - now that the set has been created, we can move forward *and* =
backward, something we can't do wth the current implementation when we =
use an OR/AND connector in the filter.
>=20
> There are some cons though :
> - gathering candiates UUID may eat memory
> - in some case, it may be slower than the current implementation, just =
because the number of selected candidate may be way higher than the =
number of entries we will return (we pay the cost of extra entry fetch =
in this case)
>=20
> But even then, we already have to keep a set of candidate UUIDs when =
doing a search with an OR/AND connector in the current implementation, =
and it' snot proven that fetching an entry is more costly than fetching =
a <key/value> of a given index - not to mention that the math may change =
a lot of we have more than one Node in the filter-
>=20
> Now, in order to fix the concurrent issue we have found last year, we =
have to add a couple of ReadWriteLock :
> o one in the OperationManager, to forbid writes to be executed while =
some searches are handling a ReadLock and to forbid a search to be =
executed while a write handles a WriteLock
> o another one in the AbstractBtreePartition to protect the master =
table from concurrent access, as the searches will directly fetch the =
entries from the MasterTable - and the RdnIndex -, even when some writes =
are updating them.
>=20
> The current branch implements this algorithm. It works, and it's fast. =
I have tested the concurrent modifications and searches that =
demonstrated the issue last year, I was able to run 10 000 loops over 20 =
threads, each one doing :
> - add cn=3Dtest<N>
> - search subtree cn=3Dtest*
> - delete cn=3D test<N>
> where N is the thread number.
>=20
> This test was badly failing on trunk (the more threads we have, the =
more errors we get), and is now running just fine.
>=20
> (The test was failing because while doing the search, we were fetching =
entries from the backend, when some deletion were being done at the same =
time).
>=20
> One more thing : performances. The branch is actually faster than the =
current trunk. A search conducted using the SearchPerfIT test.
>=20
> Here are the performances I get when using the embedded LDAP server :
>=20
> 1.5.0 (FTR) :
> - Object   Search ->  32 993/s
> - OneLevel Search -> 156 935/s
> - SubLevel Search -> 137 225/s
>=20
> Trunk :
> - Object   Search ->  57 191/s
> - OneLevel Search ->  76 250/s
> - SubLevel Search ->  77 200/s
>=20
> MVBT branch :
> - Object   Search ->  76 540/s (+33%/trunk)
> - OneLevel Search -> 109 985/s (+44%/trunk)
> - SubLevel Search -> 134 750/s (+74%/trunk)
>=20
> We get back one entry for the ObjectSearch, 5 for the OneLevel search =
and 10 for teh SubLevel search. What we measure is the number of entries =
we get back per second.
>=20
> The 1.5.0 version is actually faster for OneLevel searches and =
SubLevel searches because we were storing the DN in the entry - =
rebuilding the Dn is a costly operation -.
>=20
> The gain is ever higher when we use a complex filter, due to the =
numerous and spurious fetches of the same entry in trunk.
>=20
> The same tests but using the network :
> MVBT branch :
> Object :  2 564/s
> One    : 17 960/s
> Sub    : 23 050/s
>=20
> Trunk :
> Object :  2 567/s
> One    : 12 555/s
> Sub    : 21 240/s
>=20
> Here, the gain i sless significant, mainly because we have the client =
and the server on the same machine, and the messages encoding/decoding =
is shadowing the gain in the backend. But still, we see some gap in the =
oneLevel and subLevel searches (between 8% up to 43%).
>=20
> At this point, I think we have a working server, solving a critical =
issue we are fighting with for more than one year now, and it's actually =
faster than the current trunk. It also solves some other issues.
>=20
> Before continuing with the Mavibot backend implementation - which will =
brings some major performance boosts, AFAICT - I could suggest we =
promote the current mvbt branch as the trunk, and cut a 2.0.0-M8 release =
asap.  I do think that it's a better, more stable, and faster working =
base for the upcoming versions, and mor eimportant, the 2.0 GA.
>=20
> We shoudl also continue the ongoing work on other solutions, as they =
offer some extra advantages, but we may have them in a 2.1 or a later =
version.
>=20
> so, wdyt ?
>=20
> --=20
> Regards,
> Cordialement,
> Emmanuel L=E9charny
> www.iktek.com
>=20