directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Karasulu" <akaras...@apache.org>
Subject Re: [ApacheDS] OR expression optimizations for Cursors and Evaluators
Date Fri, 28 Mar 2008 14:53:43 GMT
On Fri, Mar 28, 2008 at 10:25 AM, Alex Karasulu <akarasulu@apache.org>
wrote:

> On Fri, Mar 28, 2008 at 7:06 AM, Emmanuel Lecharny <elecharny@gmail.com>
> wrote:
>
> > Considering we are searching in the US, the scope could be
> > c=US,dc=acme,dc=com, we will limit the search in this branch. This is
> > done by adding internally a new assertion in the filter :
> >
> > (&(ou=java)(ou=engineering)(scope.subtree="c=US,dc=acme,dc=com"))
> >
> > <note>
> > This is not exactly how it is done in the server, it's just a 10K feet
> > presentation of the principles
> > </note>
> >
>
> You're on target yeah we add this node to the AST.  BTW with alias
> dereferencing while searching the number of search bases expand in the scope
> node as we encounter aliases which "expand" the search space.   Still we add
> one scope node but that scope node performs multiple assertions.
>
> Now, the search engine do a computation on this filter to determinate
> > which index should be used. Suppose that the ou=java links to 10000
> > persons, and that the ou=engineering links to 1000 persons, we will use
> > the ou=engineering attribute to limit the number of candidate to bring
> > back from the backend, and check with the two other assertions to
> > eliminate the bad candidates. We do that by enumerating the number of
> > elements associated with eac assertion :
> >
> > (&(10000:ou=java)(1000:ou=engineering)(MAX_INT:scope.subtree=
> > "c=US,dc=acme,dc=com"))
> >
>
> Yep this "annotation" of the modified filter AST is done by the optimizer.
>   And BTW we'll be adding a new index which will give us an exact count for
> the scope node so we don't have to use MAX_INT anymore.
>
> Currently, the scope assertion is just used to eliminate the candidates
> > which are found by the search engine, we don't 'count' the number of
> > candidates under a position on the tree (so the MAX_INT value). In our
> > sample, if we suppose that the scope is containing 100 entries, we
> > immediatly see that we are not using the correct assertion to bring back
> > the entries from the backend.
>
>
> Right this will change.  I will split the current hierarchy index (a
> misnomer btw since it's really the parent-child index) into a onelevel and
> sublevel index.
>
>  Moreover if we apply the scope to the 'ou' index, we may see that we
> > have only 10 java coders under the c=US,dc=acme,dc=com branch, instead
> > of 10000 (not likely, but who knows  ? :)
> >
> > So the idea we had was to use the hierarchy index to add some
> > information about the other index. Here, we may have the number of java
> > coder in the c=US,dc=acme,dc=com branch stored under the hierarchy index
> > for the key 'c=US,dc=acme,dc=com'. We store all the counst for each
> > index into this Hierarcy index.
> > As it will be modified when we modify the other index, this might cost a
> > bit on modifications, deletions and additions, but this will have a huge
> > impact on the search requests, which is what we target.
> >
> > In our example, if we suppose that US has only 10 java coders, the
> > annotated filter will be :
> >
> > (&(10:ou=java)(200:ou=engineering)(500:scope.subtree=
> > "c=US,dc=acme,dc=com"))
> >
> > assuming that we also have 200 engineering in the branch, and that the
> > scope will bring 500 entries.
> >
> > Now, we will just bring back 10 candidates from the backend instead of
> > 1000.
> >
> > This is really a gross example, but I think it gives the idea.
> >
> > Is it totally insane, or does it makes some sense ?
> >
>
> Let me rephrase in my own words to see if I understand the idea.  You're
> suggesting the use of a kind of a scope based LUT or separate index used to
> return candidates matched by an assertion with a specific attribute and
> value limited by scope.
>
> Effectively we're going to have this (I think - please correct me) with
> the two separate system indices for hierarchical relationships instead of
> just having one parent-child index as we do now.  If you look at the
> documentation here you'll see a description of the onelevel and sublevel
> indices:
>
>
> http://cwiki.apache.org/confluence/display/DIRxSRVx11/Structure+and+Organization
>
> ======== From Doco =======
>
> "The one level index maps parent entry identifiers from the master to the
> identifiers of their children. This index is used to facilitate various name
> space operations like renaming, and moving branches which impacts children.
> It is also used by the search engine to conduct one level search requests."
>
> This index maps ancestor entry identifiers from the master to the
> identifiers of all their descendants including immediate children. It does
> not map the descendants of the context entry (at the root) of the partition.
> This is just unnecessary since all other entries in the partition satisfy
> this condition. If this is something we desire to enumerate we can get a
> reverse Cursor on the ndn index and advance past the first entry, to start
> enumerating all the descendant identifiers of the context entry.
>
> This index plays a critical role in subtree search. It allows the search
> optimizer to detect a count and annotate the modified search filter for
> subtree scope nodes. It also allows a Cursor to enumerate the set of
> descendants associated with an entry. Without this index, the search engine
> must resort to expensive DN based operations for subtree scope constraint
> checking."
> ======== From Doco =======
>
> Now with these two indices as opposed to the one we had we can better
> constrain subtree search.  The subtree search will limit the candidates as
> will the assertions in the filter as it was given by the user.
>
> But I think you mean more and I think I am catching on to it.  You want to
> constrain individual assertions off an index with scope at each level rather
> than at just the top perhaps.  I have not thought about this at all so this
> is a good time to explore it.   Here's what the search engine does to
> introduce scope into the AST:
>
>
> http://cwiki.apache.org/confluence/display/DIRxSRVx11/SearchEngine+and+Optimizer
>
> So what would taking the conjunction ('AND' or intersection) of the scope
> node with each leaf assertion do to influence the search plan and cost of
> searching instead of taking the conjunction only at the top?
>
> I guess it depends on the kind of logic in the filter.  If we have a
> disjunction (OR) then applying the scope constraint to all child
> sub-expressions could seriously reduce the IO/processing of search
> operations while doing the same work.  Take this example:
>
> (| (ou=engineering)[100] (c=US)[1000] (l=Jacksonville)[10])
>
> => 1110 worst case for entire filter
>
> If under scope constraints these numbers are reduced by a factor of 10:
>
> (| (ou=engineering)[10] (c=US)[100] (l=Jacksonville)[1])
>
> => 111 worst case for entire filter
>

More thoughts ...

So if 1000 entries are in scope.  Then the filter above with the first set
of scan counts would result in a Cursor over the subtree index (presuming
subtree search scope parameter value).  This is because 1000 < 1110.  This
is what happens when the scope node is AND'ed up at the top instead of being
pushed deeper into the AST in multiple places.

If we push the scope node down to constrain each assertion in 3 places as
well as at the top then the Cursor will still be built on the scope node at
the top most conjunction.  This is because and'ing the scope node in lower
layers still results in the same scan counts.  For example

(& (ou=engineering)[100] (scope)[1000] )

still has a total scan count of 100.  So the filter AST with scope
constraints applied all over will still build the candidate generating
cursor on the topmost conjunction's scope.  It will require 1000 iterations
with 1000*6 lookups now.  Without pushing the scope into leaves we get 1000
iterations with 1000*3 lookups.

So this is not so much a great way to implement your idea.

Alex

Mime
View raw message