directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <akaras...@apache.org>
Subject Re: Update about subtree problems
Date Tue, 20 Jul 2010 08:47:36 GMT
Hi Emmanuel,

On Tue, Jul 20, 2010 at 2:34 AM, Emmanuel Lecharny <elecharny@gmail.com>wrote:

>  Hi guys,
>
> just an update about the current situation.
>
> We discussed on this ML about options regarding how to handle subtrees in
> the most efficient way in the server. We currently have 3 possibilities :
>
> - the first one, which is partially implemented in trunk, consist on static
> references to subentries in all the entries selected by the
> subtreeSpecifications defined in each subentry.
>
>
I'm supposing you're referring to static data references in partition
Entries which reference the subentries affecting the Entry using the proper
administrative operational attributes, rather than using static Java
references in the Entry class correct?


> - the second one suppose that we compute when needed if an entry is a
> selected entry, each time we access this entry
>
> - the third solution, which has not been discussed already, would be to
> have an index which stores the list of all the selected entries associated
> with each subentry.
>
> Let's first think about the third solution (this is the one implemented by
> OpenDS, btw). It's a costly one when you create a new subentry, or when you
> move one, as you have to update the index with all the impacted entries.
> Plus it also has a cost as you have to hit the index to get the list of
> subentries an entry is part of. Of course, we can use a cache, but I'm not
> sure that it will save a lot of time. In fact, AFAICT, you pay the price
> twice : when creating the subentry, and each time you access any entry.
>
> Ok, so option 3 does not seems to be reasonable. To be frank, we discussed
> about t with Pierre-Arnaud but not really considered this as a decent
> solution. We may be wrong though.
>
>
There might be one other option. How about adding the references to the
affecting subentries to the RDN index which replaced our old DN index?
Essentially we add an additional field to store these attributes in the RDN
value which we already have turned into a structure to hold the parent ID of
entries.

I think this is more valuable than a separate index because first we don't
need fully bidirectional lookups. We will not be looking up entries by the
subentry DN to get a list. We will be checking to see if each entry before
returning it or modifying it is impacted by subentries. Second this RDN
index will most likely be in memory. It's so central we should have it all
cached anyway. Sort of like what AD does with it's global catalog.

So carrying this additional information in the RDN index hence the DN, will
allow us to rapidly determine which subentries impact an Entry.

Now in terms of handling administrative changes we still have the same
issue: a potentially massive numbers of updates will be unavoidable on the
RDN index. However maybe the heavy caching here might help speed things up.
Don't if this is the case but Emmanuel your tests on JDBM performance with
writes and caches etc might help here.

Maybe this is a horrible solution but I don't know just a brain dump.


> Regarding option 1 and 2, none is perfect.
>
> Let's see the pros and cons for each of them
>
> Solution 1 (update when manipulating the subentry)
> --------------------------------------------------
> pros :
> - *very* efficient operations on normal entries, as they already have a
> reference to all the subentries that select them.
> - no extra cost when doing a rename, a modify, an add or a delete of a
> normal entry.
> - all the subentries are stored in a cache loaded at startup, no need to
> fetch an index to get a subentry
> - evaluating if an entry is selected is done when the subentry is added or
> moved
> - move operations applied on normal entries are rare
> - any operation applied on a subentry are considered as admin operation,
> thus unfrequent and scheduled. Downtime is under control.
>
> cons :
> - *very* costly add/delete/move/rename operations applied on a subentry, as
> every selected entries will have to be modified
> - *very* costly move operation of a normal entry if it moves from one AP to
> another one, and when it has many children, as we have to update all the
> children twice : first to reove the old subentries' reference, the second
> one to add the new references
> - Adding or removing a subentry is a non atomic operation, as we have first
> to update the selected entries one by one, and then to add/delete the
> subentry
>
>
Right we have to have some kind of atomicity control here. I wish we had
MVCC to assist us here.


> Solution 2 (evaluating when needed )
> ------------------------------------
> pros :
> - adding/removing/moving/renaming a subentry is a O(1) operation
> - moving normal entries selected by some subentries does not cost more than
> a standard move
> - easy to implement
>
> cons :
> - each time we access an entry (search, compare), we will have to evaluate
> the subtreeSpecifciation for all the subentries this entry is dependent on
> - we will need another cache to lookup for the entry's related subentries
> (this cache will work pretty much like the one we use for the
> NamingContexts)
>
>
> Basically, sol.1 and sol.2 are both painful, but there is no perfect
> solution anyway.
>
> I thought about what is the best implementation all day long, and if
> yesterday I was convinced that the solution 2 was better. But the evaluation
> cost done on each entry every time a client does a search is quite high
> (let's say up to 20% of the search cost, for simple SS). That will not slow
> down the server that much, as most of the time is consummed in the network
> layer right now, but this won't last.
>
> Now I think that the current solution is may be a not so bad compromise,
> assuming we don't do a lot of move operation cross APs on normal entries.
>
> What bugs me here is that last year, for the opposite reason, the DN was
> removed from the entries stored in the master table, just t be able to do
> O(1) move operations. This is quite paradoxal ! I still think that move
> operations are very rare, and that storing the DN into the entries is a net
> gain for most of the operations, except for a move...
>
>
> Some side note :
> after having done some perf tests on the evaluator, and applied some
> improvement, I can tell that depending on the number of subentries an entry
> is depending on, the cost of this evaluation can goes up to 50% of the
> search itself cost - not counting the network layer -. For instance,
> evaluating a subtreeSpecification with a min and a max, no chop, will be
> done up to 1 000 000 times per second on a 3 level DN (this is all dependent
> on the DN size)
>
> Last, not least : the current implementation is really incomplete. The
> Move, Rename and MoveAndRename operations are not correctly handled, with
> many entries not being updated. I'm going to fix them. I have also created a
> branch to play with subtree without breaking the trunk. I'm not sure I will
> continue to work on this branch if a decision is  made to keep the first
> solution.
>


-- 
Alex Karasulu
My Blog :: http://www.jroller.com/akarasulu/
Apache Directory Server :: http://directory.apache.org
Apache MINA :: http://mina.apache.org
To set up a meeting with me: http://tungle.me/AlexKarasulu

Mime
View raw message