directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <>
Subject Re: status update on jdbm work(was Re: JDBM experimental branch created)
Date Mon, 29 Aug 2011 10:19:49 GMT
Oooops sorry just saw this thread. I responded to the other thread. Sorry
about this. Will switch to this one.

On Mon, Aug 29, 2011 at 12:55 PM, Emmanuel Lecharny <>wrote:

> Hi Selcuk,
> On 8/29/11 10:52 AM, Selcuk AYA wrote:
>> Hi I just attached my latest changes for the jdbm branch and wanted to
>> give a status update and some technical details:
> Many thanks !
>> 1) Summary
>> *We now have a jdbm tree which treats find, insert, remove and browse
>> as actions that execute in isolation. In particular, read actions will
>> not be affected by ongoing structural changes to the tree and will
>> only see data changes that completed before they started.
> Just great !
>> *We allow one writer and multiple readers to execute concurrently.
>> Synchronized operations are mostly removed.
> Woot ! We need to do some perf tests to see what kind of performances
> improvement it brings...
>> * Exisiting tests except the StoredProceduteIT and the unit tests I
>> added for the versioned tree pass( I did mvn clean install
>> -Dintegration). I think the problem with StoredProceduteIT is an
>> existing one. There is a code piece where I serialize and deserialize
>> tuple values stored in JDBM btree in order to do a deep copy. With
>> StoredProcedureIT hello world stored procedure deserialization throws
>> a UTFDataFormatException. On a clean brach, I added similar code to
>> deserialize B+ tree page values right after they are serialized, and I
>> hit the same issue. So I think this is some existing issue with stored
>> procedure serialization/deserialization.
> I gonna review the ignored test.
>> 2) Changes above JDBM level
>> * I added changes to call the newly added browser->close() interface
>> when the cursors are closed  or a cursor is repositioned.
>> * I hit some existing issues where cursors are not closed. In
>> particular, I had to change to close the
>> cursor after search operations and change the JDBM container cursor to
>> close the contained cursor  when it is closed. If required, I can
>> provide these changes as separate fixes
> Yeah, just create a new JIRA for this one, and attach new patches.
>> 3) Technical details at JDBM level:
>> *The core functionality is at This implements a
>> concurrent, versioned cache. There a power of two hash buckets and a
>> lock for each of the 8 buckets(lock striping). Number of hash buckets
>> x where x is closest power of two such that x<  max number of cache
>> entries.
> Ok.
>> *Cache replacement policy is LRU. There are 16 lru lists and each
>> cache entry is assigned to one of the lru lists. Each lru is protected
>> by a separate lock. LRU replacement is supposed to be fast. Threads
>> choose an lru based on an lru randomizer. Since replacement is
>> supposed to be fast and each thread randomly chooses an lru to replace
>> from, lru operations should not be a bottleneck.
> Do we need to make this number (16) configurable ?
>> * Each cache entry has a [startVersion, endVersion) where it is valid.
>> At any time, a hash bucket chain looks like this:
>>  (key1, startVersion11, endVersion11)<->  (key2, startversion21,
>> endVersion21)<->
>>            |
>>    |
>> (key1, startVersion12, endVersion12)         (key2, startversion22,
>> endVersion22)
>>            |
>>    |
>>         ....
>> .......
>> that is, there is a primary chain where entries for different keys are
>> chained and then there is subchain where different versions for a
>> given key are held. So, when readers search for a (key, version), they
>> first walk the primary chain and then walk the subchain to find their
>> entry.
> The schema is not rendered correctly in the mail, but it's not a big deal.
> I'm not sure I grok your explanation though. I need to re-read it later...
>> *As writes create previous versions of entries,
> You mean 'new versions', right ? Or maybe what you mean is that once you
> modify an existing entry, the previous version of this entry is stored into
> the cache ?
>  they use part of the
>> cache to store them. The rule is that such an entry cannot be replaced
>> as long as there might be a reader which might read it. We keep track
>> of the minimum read action version to make such entries replaceable .
>> *As writes create previous versions of entries, they use part of the
>> cache to store them. If there are long browse operations and quite a
>> bit of updates going on at the same time, we might run into a case
>> where most of the cache entries are used to store previous versions.
> Ok.
>  We might even have a case  where all entries store previous versions
>> and they cannot be replaced(because of the rule above). In this case,
>> writers wait for a freeable cache entry.
> Ok. We could probably chose to keep the old versions on disk, to avoid
> having to hang a write, but for a first version, I think it's an acceptable
> solution.
>   When a reader cannot find a
>> replaceable entry, it  does read from disk while holding the bucket
>> latch(and thus holding any possible writer on the same location). and
>> return the entry to the user without populating the cache and thus
>> without looking for a replaceable cache entry.  Since readers always
>> make progress, min read version will eventually advance and writers
>> will progress too. Normally, when readers or writers do IO, they
>> release the hash latch.
> Ok.
>> * There some helper classes for the LRUCache to work. Maybe the most
>> interesting ones are ActionVersioning which uses AtomicInteger and
>> AtomicReference and is optimized for the read mostly case. Also we
>> have ExplicitList where remove operations are O(1) given an
>> element(this is in contrast to O(n) remove given a pointer to an
>> element on Java's lists). Such fast removes are needed for lru
>> algorithm.
> Ok
>> *When (key,value) pairs are added to the Btree or when they are
>> retrieved, Btree does a deep copy of the value(through serialization,
>> deserialization). This is needed so that Btree can store previous
>> versions of values. I assumed key stored in Btrees are not changed. If
>> the do, even the CacheRecordManager currently in use wouldnt work.
> Do you mean we never delete anything from the BTree ?
>> 4) Possible improvements:
>> *if most of the cache entries are used to store previous versions,
>> cache effectiveness will decrease.A solultion is to start spilling
>> previous versions to disk when such a thing happens. The subchain we
>> talked about above would have to be spilled to disk. However, this is
>> only a performance problem and is a corner case one as well if it is
>> true that ldap is read mostly.
> Yes. See above one of my comment.
>> * Currently when a write action is executing, if there is an IO
>> exception action is aborted and  I do not advance the read version and
>> thus readers do not see the affects of the aborted action. However, it
>> seems that upper layers do not do enough cleanup in this case, they
>> continue using the jdbm stores and this will lead to inconsistency. A
>> good thing would be to rollback all the dirty changes . Also, jdbm
>> txns are not enable currently so a crash in the middle of syncing
>> might leave the store inconsistent.
> I guess we have to address those issues. Crashes can occur when we get a
> file-system full, or a defective disk. In both cases, I think having a crash
> recovery system should be enough, as a first approach. In any case, I guess
> that the server has to be started when it occurs, so that some corrective
> actions can be done before the backend is servered any more...
>> 5) TODO:
>> *add some more test cases for the verisioned btree to test corner cases.
>> *I am not very willing to implement disk spilling since this is only a
>> performance improvement needed in corner cases if stores are mostly
>> read only. But if you guys think this is really necessary, I might
>> look into this as well.
> Right now, I consider that it's just an improvement. What is important atm
> is to have a working server. That means we need more tests, and some wide
> ones (ie, with many clients injecting read and write requests, run for some
> long period of time).
> Selcuk, this is an excellent work ! I'm going to apply your last changes to
> the repo. Many thanks !
> --
> Regards,
> Cordialement,
> Emmanuel L├ęcharny

Best Regards,
-- Alex

View raw message