directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lecharny <elecha...@gmail.com>
Subject Re: AdministrativePoint handling
Date Mon, 13 Dec 2010 17:33:18 GMT
On 12/13/10 6:00 PM, Alex Karasulu wrote:
> On Mon, Dec 13, 2010 at 12:41 PM, Emmanuel Lecharny<elecharny@gmail.com>wrote:
>
>> Hi guys,
>>
>> I'm resuming the APs implementation. I have had a new idea this morning
>> about the best way to handle them.
>>
>> Currently, we had devised about too possible options
>>
>> 1) Processing the entries while adding a new AP
>> The idea is that when we add a new AP (or delete an existing one, or
>> modifying an existing one), we evaluate all the underlying entries and
>> update each one which is selected by the associated SubtreeSpecification by
>> adding a reference to its associated AP in the entry.
>> This is done once, then when a user grab an entry, we don't need to
>> evaluate it again, as this work has already been done, so we accept to pay
>> the price once, but it's free for every operation after this initial update.
>>
>> The price to pay is pretty expensive, if we consider a huge base with
>> hundred of thousands entries, as we have to update many, and as an update is
>> slow (we currently can process 600 per second of such updates - depending on
>> the server we are running on, of course -, so for 1 000 000 entries it might
>> take up to 30 minutes ).
>>
>> 2) Processing the evaluation on the fly
>> The second idea is to evaluate the entry when a user requests it, to avoid
>> paying the price for the initial update.
>>
>> The big problem with this approach is that if we have many APs
>> SubtreeSpecifications to evaluate, this evaluation is not free, plus it has
>> to occur *everytime* a user access the entry. That could slow down the
>> server a lot (probably double the time it takes to return an entry).
>>
>> So far, we have decider to go for option #1, simply because adding an AP is
>> an administrative task, which does not occur very often. We all realize that
>> it has drawbacks, but it's far better than option #2 anyway.
>>
>> Now, I'd like to propose a third option, which might be acceptable,
>> considering that we will pay a small price compared to both option #1 and
>> #2.
>>
>> 3) Processing on the fly and store the result
>> The idea is to process the evaluation when someone fetches an entry, but to
>> store the evaluation result in the entry. If the fetched entry has already
>> been evaluated, then no need to process it again.
>>
>> The direct benefit will be that we won't have this huge initial update
>> required by option #1, and the entry evaluation wil be just a matter to
>> check if an AT has been set instead of fully evalute the entry as in option
>> #2.
>>
>>
> This sounds like an excellent compromise.
>
> Is there a chance though of triggering a large set of these updates on for
> example a search making this approach in practice perform close to approach
> #1? Let me explain:
>
> Usually directories containing millions of entries have most of their
> entries under a small number of containers like ou=users for example.
> Directories today are flatter than in the past where there was more depth
> and hierarchy.
>
> Using a search filter without many constraints (AVAs) under a congested base
> with one or sub tree scope may trigger the need to evaluate a large set of
> entries.
>
> To think this through let me just for the record go through the way search
> interacts with the administrative model subsystem. I won't consider
> base/object scoped searches.
>
>     (1) On the way into the server a search may have some evaluations done on
>
>     (2) The search call returns a Cursor for the search request parameters
> positioned
>           at the first candidate to return. This may have actually processed
> several
>           candidates that do not comply with the filter. During candidate
> processing
>           administrative aspects will be considered, so subtree specification
> evaluation
>           will occur.
>
>     (3) There after calls to get the next candidate and so on repeat the
> steps in #2
>           until the Cursor is exhausted or search limits if in effect halt
> the search.
>
> In step #2 evaluations checking entries for inclusion in administrative
> areas defined by subtree specifications, occur only after the filter's
> constraints have been applied and the entry in question passes those
> assertions.
>
> So a search can only trigger updates on entries satisfying it's filter. Also
> this will be shorted by search limits as well for standard users
> (non-admin). If the filter is not very restrictive for example
> (objectClass=*) then this is bad but search limits should constrain the
> number of updates performed.
>
> So only a subset of the updates will take place. It does sound good going
> through the entire thought process.

If I understand the objection, which makes sense, then if the 
SubtreeSpecification select an area which is smaller than the subtree 
content, I mean, really smaller, then yes, the proposed solution might 
be slower.

For instance, if we suppose we have a 1 million user entries in a flat 
tree, with some Subtree Specification selecting 1000 users out of the 
1M, then using a precomputed index will speed up the search withing this 
subtreeSpecification, as we can use this index to reduce the number of 
candidates.

However, we can provide a mechanism that creates an index if needed. Let 
me explain  : if the speed is really an issue in such a case, what about 
having an extended operation that will ask the server to create such an 
index for a AP, and store the information in this AP, so that the search 
engine can leverage it if it exists ? Of course, it requires that the 
full evaluation is to be done.

>
> What about when the AP is modified and deleted ?
AP modification and deletion is not an issue. Each AP has a unique TS, 
and when it's deleted, or moved, the entries are depending on either 
another AP, with a different TS (at this point, it's probably better to 
name it a revision number, but it does not matter too much. What is 
important is that this number *must* be unique).

Let's see what it gives for a deletion :
- if the deleted AP depends on a higher AP in the tree, then the entries 
associated with the deleted AP will now depend on the higher AP. But the 
reference to the deleted AP will be irrelevant now, and the TS will also 
be irrelevant. When someone fetch this entry, the server will search for 
the closest AP, and will compare the AP's TS with the entry's TS : as it 
will be different, we will remove the ref from the entry, and update it, 
then update the TS. We are done !

For a modification, it's the same thing :
- move : we still compare the entries' TS with the AP they depend on 
(the moved AP or the higher AP, it depends on the situation.
- rename : nothing change

>
>> a Timestamp (TS) is updated.
>
> Is this a global timestamp used for this purpose system wide?
It's more a revision number, or a unique identifier. A SeqNumber, in fact.
>
>> This TS will be incremented every time we add a new AP.
>
> Again also when we update or delete an AP (implies subentries as well)?
Same, if we update the AP content, the TS must be updated.
>
>> We don't update any entry.
>> Then when a user fetch some entries, for every one of the selected entries,
>> we check if it has a TS AT present.
>
> I presume this timestamp attribute is separate from the modifyTimestamp or
> createTimestamp right?
right, otherwise we might have conflict as the server is so damn fast :)
> So there's a special timestamp in the entry and on somewhere in the server
> to do these TS comparisons?
yes, in each entries and in the AP. It will be an OperationalAttribute.
>
>> If not, then we evaluate the entry against the AP it depends upon, and when
>> done, we add the last TS into the entry, save it and return (or not) the
>> entry to the user.
>> If the entry has a TS AT, then we have two cases
>> - The TS equals the last TS : the entry has already been evaluated, and we
>> can return it (or not, depending on the evaluation result)
>> - The TS is below, so we reevaluate the entry.
>>
>> Now, how will we distinguish between an entry evaluating to true or false ?
>> Easy : if the entry is selected by a SubtreeSpecification, we will have a
>> ref to the associated AP in it, otherwise no.
>>
>>
> So you have to update the timestamp and the ref back to the AP containing
> the subentry selecting it right?
right. We use the EntryUUID to avoid all the updates when moving an AP 
(we already decided to do that, even if it cost an extra lookup in the 
DN/UUID table)
>
>> I do think that this third option has great advantages, and not only from
>> the user or admin POV : the implementation will be greatly improved, as we
>> don't have to deal with all the cases when handling an AP
>> addition/modification/deletion.
>>
>>
> It's still somewhat hazy.
Yeah, a bit ! I must admit it cames to my mind this morning, I have no 
idea why. A bit like when you found the best way to deal with DN...

Not sure I cover all the bases, but sounds promising and deserves a 
discussion...
>
>> One last thing : one may object that the on the fly evaluation is not
>> acceptable, as it will cost a lot for the first access to an entry. True.
>
> I'd think it's very acceptable but does the gain outweigh the complexity is
> my question and I still don't fully understand your approach.
The strange thing is that it seems to me less complex to implement that 
to deal with all the potential cases if we process all the entries when 
adding an AP. For instance, we don't anymore have to take care of the 
atomicity for AP operations, as we just update the AP, nothing else. I 
was stuck yesterday trying to see how we can revert an AP addition if we 
started to modify thousands of entries and suddenly we get a failure...

There are also extra complexity in the current algo, like how to avoid 
updating the entries more than once, as we have 4 roles to deal with, 
and this is really very complex. Using the proposed algo will allow us 
to just take care of a role for an entry alone.

> This is a really complex part of the server and I'm hoping we can find ways
> to simplify it further.
>
> At some point you had a great idea of using indirection via UUID instead of
> a direct reference with DN back to the AP.  Are you still thinking about
> this approach. I need to dig up that email you originally had sent to the
> dev list. It was a great idea.
Yeah, absolutely. But it's really orthogonal at this point. Whatever 
solution we will use, we will manipulate entryUUID.
>
>> But I claim that it's the administrator to deal with this problem, and then
>> its not anymore a user issue. When the admin add an new AP, he also can
>> fetch all the entries below the added AP, and they will be updated on the
>> fly. The users won't be impacted anymore.
>>
>> thoughts ?
>>
>>
> The idea is great but I'm worried about how much more complexity it may
> introduce on top of something that already needs simplification but I guess
> experimentation will really tell you that.
probably. Keep in mind that introducing IAP is already injecting another 
magnitude of complexity, but I really don't see how we can avoid having 
IAP.
> The best thing to do is get a clear idea of your approach and try
> implementing it in a branch after some discussion to help address all the
> issues.
I was also thinking that we might want to have heuristics. Sometime, it 
might be better to have all the entries being evaluated once and 
forever, sometime we don't care.

Btw, the trunk does not have anything working atm, so we need something 
working quite quickly.

I'll create a branch to experiment, that will allow others peeps to 
continue working on the server.

-- 
Regards,
Cordialement,
Emmanuel L├ęcharny
www.iktek.com


Mime
View raw message