Return-Path: X-Original-To: apmail-directory-dev-archive@www.apache.org Delivered-To: apmail-directory-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA7C4DEB1 for ; Fri, 21 Sep 2012 10:05:53 +0000 (UTC) Received: (qmail 99701 invoked by uid 500); 21 Sep 2012 10:05:53 -0000 Delivered-To: apmail-directory-dev-archive@directory.apache.org Received: (qmail 99407 invoked by uid 500); 21 Sep 2012 10:05:48 -0000 Mailing-List: contact dev-help@directory.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Apache Directory Developers List" Delivered-To: mailing list dev@directory.apache.org Received: (qmail 99338 invoked by uid 99); 21 Sep 2012 10:05:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Sep 2012 10:05:47 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of pajbam@gmail.com designates 209.85.212.170 as permitted sender) Received: from [209.85.212.170] (HELO mail-wi0-f170.google.com) (209.85.212.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Sep 2012 10:05:39 +0000 Received: by wibcb5 with SMTP id cb5so1332922wib.1 for ; Fri, 21 Sep 2012 03:05:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=aCJiUKA/soDLa/L3acgvmqxyYn0wrAhhsNdRnNVpn8Y=; b=ESX2ZsPyaBZHrzGYCW7KqlwgPvkJGcU5qXwJ9TogvXDbbOBBfGa9Qn6CwB46Q6mT6M /v290/IDMCo27sd6bz95CE2/wu3E3JDSJGK4UHAAQNfWCteq8bJZtn2AT/ZLt0A77fKt DOAPJxFgm7o+iWiqAUmFl5Ler8B2wV+W7vSxF/E4Gz9nhE6X+FD6/FpfWz1Dl9QDv6sI BxiV98Jewkf/0jPuztqtzK0KMnXjvZDgFKGgaXxERGzk9WxIxW7ZaArKgrxw3g86smFX PBwh+g9liSod0XgMZvtjtNODiCrrNxCtmK4IriiGKLIBqFbRZfA8rTh58OaYfIQk5h62 AIsw== Received: by 10.180.105.6 with SMTP id gi6mr3525003wib.4.1348221918871; Fri, 21 Sep 2012 03:05:18 -0700 (PDT) Received: from [10.0.1.8] (def92-4-82-225-58-213.fbx.proxad.net. [82.225.58.213]) by mx.google.com with ESMTPS id k2sm15786757wiz.7.2012.09.21.03.05.17 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 21 Sep 2012 03:05:18 -0700 (PDT) Sender: Pierre-Arnaud Marcelot Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Subject: Re: Current status of the server, and mvbt-branch status From: Pierre-Arnaud Marcelot In-Reply-To: <5059D7E7.7080705@gmail.com> Date: Fri, 21 Sep 2012 12:05:16 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <0E3DAE36-C247-4E34-860F-7B722E13B911@marcelot.net> References: <5059D7E7.7080705@gmail.com> To: "Apache Directory Developers List" , elecharny@apache.org X-Mailer: Apple Mail (2.1498) Very interesting improvements. Love the increase in performance and simplification in class hierarchy. +1 for the merge. Regards, Pierre-Arnaud On 19 sept. 2012, at 16:34, Emmanuel L=E9charny = wrote: > Last year in september, I detected a problem when conducting = concurrent searches and modifications [1]. This problem is a real = show-stopper. >=20 > The problem was that as we were fetching the entries one by one, using = indexes, if some modification occurs and impacts the index we are using = while the search is being processed, we may not be able to get back the = values we are pointing to in the search. This results to NPE, at best. >=20 > We looked for ways to solve this issue, and a few of them are = available. > 1) The first solution would be to implement a in-memory MVCC system, = with a write-ahead log. The idea is to keep in memory all the modified = values until all the pending searches are completed. They can then be = removed from memory. This solution would be cross-partition. > 2) The second solution would be to use MVCC backends. As we don't = remove any element unless they are not used anymore, this is a guarantee = that we will always be able to fetch an element from the backend, even = if it has been modified. We just will get back an old revision. > 3) A third solution, which does not imply that we implement a MVCC = global system, or a MVCC backend, would be to serialize writes and = reads. Reads would be concurrent, and writes can only be done when all = the pending reads and writes are completed. Of course, if a read is = received while a write is being processed, this read will have to wait = for the write completion. In order to keep performance good, we need to = speed up the reads so that writes are not delayed for ever : this is = done by computer the set of candidates immediately, before fetching the = entries one by one later. If an entry has been removed whil we fetch it, = then this entry will simply be ignored. >=20 > Those three solutions have pros and cons. We will analyze them in a = further paragraph. >=20 > I wanted to play around the 2nd idea, and as I needed a in-memory MVCC = BTree for that, I wrote a prototype BTree for that purpose (Mavibot = [2]). >=20 > While I thought Mavibot was ready (more or less), I created a new = branch (apacheds-mvbt) to experiment the new backend. I had to do some = cleanup in order to be able to use the Mavibot implementation : >=20 > o Use UUID instead of Long for Entry's identifiers > o Simplify the cursor implementation, and use of generic [3] >=20 >=20 > After a few days, I found that some technical choice we have made are = problematic, and I went a bit further. >=20 > 1) The cursor Interface suppose you can move forward and backward. = Sadly, when you have a search using a filter with an 'OR' connector, = there is no way you can move backward. Not that it's really a problem in = a standard usage of LDAP, but that would defeat the VLV control. >=20 > 2) Not fetching entries immediately would potentially save time, if = the entry is not a valid candidate against the filter we use. However, = using indexes to check if an entry is valid or not mans we go down a = BTree, which is also a costly operation. If we have a filter with N = indexed attributes, we will go down N BTrees, for each candidate, = instead of fetching the entry once, and simply validate it against the = filters using comparators. >=20 > 3) We were using generics extensively, in a way that makes it = extremely confusing to know what is what (having 3 parameters for an = IndexCursor does not help at all) >=20 > 4) We were using reverse table for every index, even when it was = useless. >=20 > 5) The Index initialization was problematic >=20 > 6) Once we are using UUID instead of Long, there is no reason to ue = the entryUUID index anymore, we can do a full scan using the MasterTable = directly >=20 > 7) The Cursor hierarchy was not fully consistent >=20 > 8) Entry were fetched more than once in complex filters, slowing down = badly the procssing of such searches >=20 > 9) Some evaluators could be removed sparing some CPU when the number = of candidate they were selecting was 0 >=20 > 10) Some Or and And cursor could be removed if they were having only = one candidate >=20 > All those items have been fixed in the mvbt-branch. >=20 > Then, before implementing Mavibot, I found that we could implement the = 3rd strategy easily. The idea is to allow concurrent searches, but = writes would be serialized. >=20 > The problem was that with the way we were using cursors, we may browse = a BTree for a very long time - especially when processing a PagedSearch = - and we can't expect the writes to wait until such a search is = completed. Another solution was to gather all the candidates *before* = fetching any entry, and to store their associated UUID into a list of = candidates. >=20 > This is not any different from how we were conducting searches with a = OR in the filter : it also gathers all the candidate UUID up to the end = of the search. >=20 > Doing so offers many advantage : > - we can exit the search faster, with all the candidates already = selected (even for a paged search), and the writes can be applied more = frequently > - we speedup searches where more than one filter can be used to select = the valid candidates > - now that the set has been created, we can move forward *and* = backward, something we can't do wth the current implementation when we = use an OR/AND connector in the filter. >=20 > There are some cons though : > - gathering candiates UUID may eat memory > - in some case, it may be slower than the current implementation, just = because the number of selected candidate may be way higher than the = number of entries we will return (we pay the cost of extra entry fetch = in this case) >=20 > But even then, we already have to keep a set of candidate UUIDs when = doing a search with an OR/AND connector in the current implementation, = and it' snot proven that fetching an entry is more costly than fetching = a of a given index - not to mention that the math may change = a lot of we have more than one Node in the filter- >=20 > Now, in order to fix the concurrent issue we have found last year, we = have to add a couple of ReadWriteLock : > o one in the OperationManager, to forbid writes to be executed while = some searches are handling a ReadLock and to forbid a search to be = executed while a write handles a WriteLock > o another one in the AbstractBtreePartition to protect the master = table from concurrent access, as the searches will directly fetch the = entries from the MasterTable - and the RdnIndex -, even when some writes = are updating them. >=20 > The current branch implements this algorithm. It works, and it's fast. = I have tested the concurrent modifications and searches that = demonstrated the issue last year, I was able to run 10 000 loops over 20 = threads, each one doing : > - add cn=3Dtest > - search subtree cn=3Dtest* > - delete cn=3D test > where N is the thread number. >=20 > This test was badly failing on trunk (the more threads we have, the = more errors we get), and is now running just fine. >=20 > (The test was failing because while doing the search, we were fetching = entries from the backend, when some deletion were being done at the same = time). >=20 > One more thing : performances. The branch is actually faster than the = current trunk. A search conducted using the SearchPerfIT test. >=20 > Here are the performances I get when using the embedded LDAP server : >=20 > 1.5.0 (FTR) : > - Object Search -> 32 993/s > - OneLevel Search -> 156 935/s > - SubLevel Search -> 137 225/s >=20 > Trunk : > - Object Search -> 57 191/s > - OneLevel Search -> 76 250/s > - SubLevel Search -> 77 200/s >=20 > MVBT branch : > - Object Search -> 76 540/s (+33%/trunk) > - OneLevel Search -> 109 985/s (+44%/trunk) > - SubLevel Search -> 134 750/s (+74%/trunk) >=20 > We get back one entry for the ObjectSearch, 5 for the OneLevel search = and 10 for teh SubLevel search. What we measure is the number of entries = we get back per second. >=20 > The 1.5.0 version is actually faster for OneLevel searches and = SubLevel searches because we were storing the DN in the entry - = rebuilding the Dn is a costly operation -. >=20 > The gain is ever higher when we use a complex filter, due to the = numerous and spurious fetches of the same entry in trunk. >=20 > The same tests but using the network : > MVBT branch : > Object : 2 564/s > One : 17 960/s > Sub : 23 050/s >=20 > Trunk : > Object : 2 567/s > One : 12 555/s > Sub : 21 240/s >=20 > Here, the gain i sless significant, mainly because we have the client = and the server on the same machine, and the messages encoding/decoding = is shadowing the gain in the backend. But still, we see some gap in the = oneLevel and subLevel searches (between 8% up to 43%). >=20 > At this point, I think we have a working server, solving a critical = issue we are fighting with for more than one year now, and it's actually = faster than the current trunk. It also solves some other issues. >=20 > Before continuing with the Mavibot backend implementation - which will = brings some major performance boosts, AFAICT - I could suggest we = promote the current mvbt branch as the trunk, and cut a 2.0.0-M8 release = asap. I do think that it's a better, more stable, and faster working = base for the upcoming versions, and mor eimportant, the 2.0 GA. >=20 > We shoudl also continue the ongoing work on other solutions, as they = offer some extra advantages, but we may have them in a 2.1 or a later = version. >=20 > so, wdyt ? >=20 > --=20 > Regards, > Cordialement, > Emmanuel L=E9charny > www.iktek.com >=20