directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny <elecha...@apache.org>
Subject Re: Update about subtree problems
Date Tue, 20 Jul 2010 09:15:00 GMT
  On 7/20/10 10:47 AM, Alex Karasulu wrote:
> Hi Emmanuel,
>
> On Tue, Jul 20, 2010 at 2:34 AM, Emmanuel Lecharny<elecharny@gmail.com>wrote:
>
>>   Hi guys,
>>
>> just an update about the current situation.
>>
>> We discussed on this ML about options regarding how to handle subtrees in
>> the most efficient way in the server. We currently have 3 possibilities :
>>
>> - the first one, which is partially implemented in trunk, consist on static
>> references to subentries in all the entries selected by the
>> subtreeSpecifications defined in each subentry.
>>
>>
> I'm supposing you're referring to static data references in partition
> Entries which reference the subentries affecting the Entry using the proper
> administrative operational attributes, rather than using static Java
> references in the Entry class correct?

I'm reffering to the subentry DN which are stored in the entries, as 
operational attributes, right.
>
>> - the second one suppose that we compute when needed if an entry is a
>> selected entry, each time we access this entry
>>
>> - the third solution, which has not been discussed already, would be to
>> have an index which stores the list of all the selected entries associated
>> with each subentry.
>>
>> Let's first think about the third solution (this is the one implemented by
>> OpenDS, btw). It's a costly one when you create a new subentry, or when you
>> move one, as you have to update the index with all the impacted entries.
>> Plus it also has a cost as you have to hit the index to get the list of
>> subentries an entry is part of. Of course, we can use a cache, but I'm not
>> sure that it will save a lot of time. In fact, AFAICT, you pay the price
>> twice : when creating the subentry, and each time you access any entry.
>>
>> Ok, so option 3 does not seems to be reasonable. To be frank, we discussed
>> about t with Pierre-Arnaud but not really considered this as a decent
>> solution. We may be wrong though.
>>
>>
> There might be one other option. How about adding the references to the
> affecting subentries to the RDN index which replaced our old DN index?
> Essentially we add an additional field to store these attributes in the RDN
> value which we already have turned into a structure to hold the parent ID of
> entries.
uh oh, sounds smart :) But sadly, the inference is the same : you will 
have to inject the references in *all* the RDN of *ll* the selected 
entries. It does not change the overall cost, which is still O(n).
> I think this is more valuable than a separate index because first we don't
> need fully bidirectional lookups. We will not be looking up entries by the
> subentry DN to get a list. We will be checking to see if each entry before
> returning it or modifying it is impacted by subentries. Second this RDN
> index will most likely be in memory. It's so central we should have it all
> cached anyway. Sort of like what AD does with it's global catalog.
Yes, we need a cache for the RDN. This will be one of the major 
improvement we can make to the server once we get 2.0 out.
> So carrying this additional information in the RDN index hence the DN, will
> allow us to rapidly determine which subentries impact an Entry.
see above : it will take longer to determinate if an entry is impacted, 
as you will have to check the RDN, when you already have the information 
in the op attr. Ok, only slightly longer.
> Now in terms of handling administrative changes we still have the same
> issue: a potentially massive numbers of updates will be unavoidable on the
> RDN index. However maybe the heavy caching here might help speed things up.
No, because you still have to flush the data on disk. The problem is the 
same : O(n) operation, whatever we do.
> Don't if this is the case but Emmanuel your tests on JDBM performance with
> writes and caches etc might help here.
>
> Maybe this is a horrible solution but I don't know just a brain dump.

I don't know how to transform a O(n) problem on O(1) either ;)
>
>> Regarding option 1 and 2, none is perfect.
>>
>> Let's see the pros and cons for each of them
>>
>> Solution 1 (update when manipulating the subentry)
>> --------------------------------------------------
>> pros :
>> - *very* efficient operations on normal entries, as they already have a
>> reference to all the subentries that select them.
>> - no extra cost when doing a rename, a modify, an add or a delete of a
>> normal entry.
>> - all the subentries are stored in a cache loaded at startup, no need to
>> fetch an index to get a subentry
>> - evaluating if an entry is selected is done when the subentry is added or
>> moved
>> - move operations applied on normal entries are rare
>> - any operation applied on a subentry are considered as admin operation,
>> thus unfrequent and scheduled. Downtime is under control.
>>
>> cons :
>> - *very* costly add/delete/move/rename operations applied on a subentry, as
>> every selected entries will have to be modified
>> - *very* costly move operation of a normal entry if it moves from one AP to
>> another one, and when it has many children, as we have to update all the
>> children twice : first to reove the old subentries' reference, the second
>> one to add the new references
>> - Adding or removing a subentry is a non atomic operation, as we have first
>> to update the selected entries one by one, and then to add/delete the
>> subentry
>>
>>
> Right we have to have some kind of atomicity control here. I wish we had
> MVCC to assist us here.
>
I'm not sure atomicity is an issue at all. Of course, if the server 
crashes in the middle of the operation, we will have a broken base with 
some entries updated and some not, leaving the subentry selection 
inconsistent. However, with the Journal, we should be able to restore 
the base by reapplying the operation.

If we consider the addition of a subentry ( or a removal), then as it 
can take ages, a search done in the mean time might well give back an 
entry with a reference to a subentry which will be removed soon, or on 
the other side, no reference when it will be added later. Is this a 
problem ? It all depends on when we update the subentry :
- subentry addition : the subentry must be added prior to any update 
done on selected entries, otherwise the reference to the added subentry 
will be null
- subentry deletion : we must first remove the reference to the subentry 
in all the selected entries first, then remove the subentry at the end, 
otherwise, same punishment when we try to point to the subentry
- move : consider a move as a delete/add operation
- rename : same : delete/add operation
- modify : same as add

I hope I have covered all the possible situations here. of course, with 
solution 2, all this is not necessary, because the subentries are 
evaluated on the fly. Another pro for sol. 2.

-- 
Regards,
Cordialement,
Emmanuel L├ęcharny
www.iktek.com


Mime
View raw message