Return-Path: X-Original-To: apmail-directory-dev-archive@www.apache.org Delivered-To: apmail-directory-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A2FF479A9 for ; Mon, 29 Aug 2011 08:28:19 +0000 (UTC) Received: (qmail 53609 invoked by uid 500); 29 Aug 2011 08:28:18 -0000 Delivered-To: apmail-directory-dev-archive@directory.apache.org Received: (qmail 53174 invoked by uid 500); 29 Aug 2011 08:28:09 -0000 Mailing-List: contact dev-help@directory.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Apache Directory Developers List" Delivered-To: mailing list dev@directory.apache.org Received: (qmail 53134 invoked by uid 99); 29 Aug 2011 08:28:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Aug 2011 08:28:03 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ayaselcuk@gmail.com designates 209.85.220.178 as permitted sender) Received: from [209.85.220.178] (HELO mail-vx0-f178.google.com) (209.85.220.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Aug 2011 08:27:57 +0000 Received: by vxi19 with SMTP id 19so4852433vxi.37 for ; Mon, 29 Aug 2011 01:27:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=2A6QZ4B46FWaBzv6C7TmSDlTTrxDOY7gp/vkFR/FH/o=; b=EJ1ki4K7BI2YwS/BdjTeY7ISjSNOt1nWpbFS9fNADyWTBeMrYZxGg7LHvTsjK787uc t8QrL3n9F3PY5KtwfPpecwvVcPzfZulhelLRLzFuKa1RiX20awf06gHrW9cEXMgGrd1+ RNF369dCyEcgkyz/jfmvW/YgHNK3fBwVh+S/0= MIME-Version: 1.0 Received: by 10.220.94.68 with SMTP id y4mr710315vcm.71.1314606456659; Mon, 29 Aug 2011 01:27:36 -0700 (PDT) Received: by 10.52.167.162 with HTTP; Mon, 29 Aug 2011 01:27:36 -0700 (PDT) Date: Mon, 29 Aug 2011 11:27:36 +0300 Message-ID: Subject: Status update on jdbm work (was Re: JDBM experimental branch created) From: Selcuk AYA To: Apache Directory Developers List , elecharny@apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi I just attached my latest changes for the jdbm branch and wanted to give a status update and some technical details: 1) Summary *We now have a jdbm tree which treats find, insert, remove and browse as actions that execute in isolation. In particular, read actions will not be affected by ongoing structural changes to the tree and will only see data changes that completed before they started. *We allow one writer and multiple readers to execute concurrently. Synchronized operations are mostly removed. * Exisiting tests except the StoredProceduteIT and the unit tests I added for the versioned tree pass( I did mvn clean install -Dintegration). I think the problem with StoredProceduteIT is an existing one. There is a code piece where I serialize and deserialize tuple values stored in JDBM btree in order to do a deep copy. With StoredProcedureIT hello world stored procedure deserialization throws a UTFDataFormatException. On a clean brach, I added similar code to deserialize B+ tree page values right after they are serialized, and I hit the same issue. So I think this is some existing issue with stored procedure serialization/deserialization. 2) Changes above JDBM level * I added changes to call the newly added browser->close() interface when the cursors are closed or a cursor is repositioned. * I hit some existing issues where cursors are not closed. In particular, I had to change SubentryInterceptor.java.java to close the cursor after search operations and change the JDBM container cursor to close the contained cursor when it is closed. If required, I can provide these changes as separate fixes 3) Technical details at JDBM level: *The core functionality is at LRUCache.java. This implements a concurrent, versioned cache. There a power of two hash buckets and a lock for each of the 8 buckets(lock striping). Number of hash buckets x where x is closest power of two such that x < max number of cache entries. *Cache replacement policy is LRU. There are 16 lru lists and each cache entry is assigned to one of the lru lists. Each lru is protected by a separate lock. LRU replacement is supposed to be fast. Threads choose an lru based on an lru randomizer. Since replacement is supposed to be fast and each thread randomly chooses an lru to replace from, lru operations should not be a bottleneck. * Each cache entry has a [startVersion, endVersion) where it is valid. At any time, a hash bucket chain looks like this: (key1, startVersion11, endVersion11) <-> (key2, startversion21, endVersion21) <-> | = | (key1, startVersion12, endVersion12) (key2, startversion22, endVersion22) | = | .... ...= .... that is, there is a primary chain where entries for different keys are chained and then there is subchain where different versions for a given key are held. So, when readers search for a (key, version), they first walk the primary chain and then walk the subchain to find their entry. *As writes create previous versions of entries, they use part of the cache to store them. The rule is that such an entry cannot be replaced as long as there might be a reader which might read it. We keep track of the minimum read action version to make such entries replaceable . *As writes create previous versions of entries, they use part of the cache to store them. If there are long browse operations and quite a bit of updates going on at the same time, we might run into a case where most of the cache entries are used to store previous versions. We might even have a case where all entries store previous versions and they cannot be replaced(because of the rule above). In this case, writers wait for a freeable cache entry. When a reader cannot find a replaceable entry, it does read from disk while holding the bucket latch(and thus holding any possible writer on the same location). and return the entry to the user without populating the cache and thus without looking for a replaceable cache entry. Since readers always make progress, min read version will eventually advance and writers will progress too. Normally, when readers or writers do IO, they release the hash latch. * There some helper classes for the LRUCache to work. Maybe the most interesting ones are ActionVersioning which uses AtomicInteger and AtomicReference and is optimized for the read mostly case. Also we have ExplicitList where remove operations are O(1) given an element(this is in contrast to O(n) remove given a pointer to an element on Java's lists). Such fast removes are needed for lru algorithm. *When (key,value) pairs are added to the Btree or when they are retrieved, Btree does a deep copy of the value(through serialization, deserialization). This is needed so that Btree can store previous versions of values. I assumed key stored in Btrees are not changed. If the do, even the CacheRecordManager currently in use wouldnt work. 4) Possible improvements: *if most of the cache entries are used to store previous versions, cache effectiveness will decrease.A solultion is to start spilling previous versions to disk when such a thing happens. The subchain we talked about above would have to be spilled to disk. However, this is only a performance problem and is a corner case one as well if it is true that ldap is read mostly. * Currently when a write action is executing, if there is an IO exception action is aborted and I do not advance the read version and thus readers do not see the affects of the aborted action. However, it seems that upper layers do not do enough cleanup in this case, they continue using the jdbm stores and this will lead to inconsistency. A good thing would be to rollback all the dirty changes . Also, jdbm txns are not enable currently so a crash in the middle of syncing might leave the store inconsistent. 5) TODO: *add some more test cases for the verisioned btree to test corner cases. *I am not very willing to implement disk spilling since this is only a performance improvement needed in corner cases if stores are mostly read only. But if you guys think this is really necessary, I might look into this as well. regards, Selcuk Aya On Thu, Aug 25, 2011 at 12:38 AM, Emmanuel Lecharny w= rote: > On 8/24/11 10:03 PM, Selcuk AYA wrote: >> >> Hi, >> no problem. Please let me know the JIRA issue number and I can provide >> you with a patch containing a basic working version of the versioned >> Btree. > > probably : > > https://issues.apache.org/jira/browse/DIRSERVER-1642 > > > > -- > Regards, > Cordialement, > Emmanuel L=E9charny > www.iktek.com > >