directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Karasulu" <akaras...@apache.org>
Subject Re: [ApacheDS] OR expression optimizations for Cursors and Evaluators
Date Fri, 28 Mar 2008 16:39:07 GMT
Another thought...

I would love to setup some performance tests to validate these ideas as
well.

Alex

On Fri, Mar 28, 2008 at 10:53 AM, Alex Karasulu <akarasulu@apache.org>
wrote:

>
>
> On Fri, Mar 28, 2008 at 10:25 AM, Alex Karasulu <akarasulu@apache.org>
> wrote:
>
> > On Fri, Mar 28, 2008 at 7:06 AM, Emmanuel Lecharny <elecharny@gmail.com>
> > wrote:
> >
> > > Considering we are searching in the US, the scope could be
> > > c=US,dc=acme,dc=com, we will limit the search in this branch. This is
> > > done by adding internally a new assertion in the filter :
> > >
> > > (&(ou=java)(ou=engineering)(scope.subtree="c=US,dc=acme,dc=com"))
> > >
> > > <note>
> > > This is not exactly how it is done in the server, it's just a 10K feet
> > > presentation of the principles
> > > </note>
> > >
> >
> > You're on target yeah we add this node to the AST.  BTW with alias
> > dereferencing while searching the number of search bases expand in the scope
> > node as we encounter aliases which "expand" the search space.   Still we add
> > one scope node but that scope node performs multiple assertions.
> >
> > Now, the search engine do a computation on this filter to determinate
> > > which index should be used. Suppose that the ou=java links to 10000
> > > persons, and that the ou=engineering links to 1000 persons, we will
> > > use
> > > the ou=engineering attribute to limit the number of candidate to bring
> > > back from the backend, and check with the two other assertions to
> > > eliminate the bad candidates. We do that by enumerating the number of
> > > elements associated with eac assertion :
> > >
> > > (&(10000:ou=java)(1000:ou=engineering)(MAX_INT:scope.subtree=
> > > "c=US,dc=acme,dc=com"))
> > >
> >
> > Yep this "annotation" of the modified filter AST is done by the
> > optimizer.   And BTW we'll be adding a new index which will give us an exact
> > count for the scope node so we don't have to use MAX_INT anymore.
> >
> > Currently, the scope assertion is just used to eliminate the candidates
> > > which are found by the search engine, we don't 'count' the number of
> > > candidates under a position on the tree (so the MAX_INT value). In our
> > > sample, if we suppose that the scope is containing 100 entries, we
> > > immediatly see that we are not using the correct assertion to bring
> > > back
> > > the entries from the backend.
> >
> >
> > Right this will change.  I will split the current hierarchy index (a
> > misnomer btw since it's really the parent-child index) into a onelevel and
> > sublevel index.
> >
> >  Moreover if we apply the scope to the 'ou' index, we may see that we
> > > have only 10 java coders under the c=US,dc=acme,dc=com branch, instead
> > > of 10000 (not likely, but who knows  ? :)
> > >
> > > So the idea we had was to use the hierarchy index to add some
> > > information about the other index. Here, we may have the number of
> > > java
> > > coder in the c=US,dc=acme,dc=com branch stored under the hierarchy
> > > index
> > > for the key 'c=US,dc=acme,dc=com'. We store all the counst for each
> > > index into this Hierarcy index.
> > > As it will be modified when we modify the other index, this might cost
> > > a
> > > bit on modifications, deletions and additions, but this will have a
> > > huge
> > > impact on the search requests, which is what we target.
> > >
> > > In our example, if we suppose that US has only 10 java coders, the
> > > annotated filter will be :
> > >
> > > (&(10:ou=java)(200:ou=engineering)(500:scope.subtree=
> > > "c=US,dc=acme,dc=com"))
> > >
> > > assuming that we also have 200 engineering in the branch, and that the
> > > scope will bring 500 entries.
> > >
> > > Now, we will just bring back 10 candidates from the backend instead of
> > > 1000.
> > >
> > > This is really a gross example, but I think it gives the idea.
> > >
> > > Is it totally insane, or does it makes some sense ?
> > >
> >
> > Let me rephrase in my own words to see if I understand the idea.  You're
> > suggesting the use of a kind of a scope based LUT or separate index used to
> > return candidates matched by an assertion with a specific attribute and
> > value limited by scope.
> >
> > Effectively we're going to have this (I think - please correct me) with
> > the two separate system indices for hierarchical relationships instead of
> > just having one parent-child index as we do now.  If you look at the
> > documentation here you'll see a description of the onelevel and sublevel
> > indices:
> >
> >
> > http://cwiki.apache.org/confluence/display/DIRxSRVx11/Structure+and+Organization
> >
> > ======== From Doco =======
> >
> > "The one level index maps parent entry identifiers from the master to
> > the identifiers of their children. This index is used to facilitate various
> > name space operations like renaming, and moving branches which impacts
> > children. It is also used by the search engine to conduct one level search
> > requests."
> >
> > This index maps ancestor entry identifiers from the master to the
> > identifiers of all their descendants including immediate children. It does
> > not map the descendants of the context entry (at the root) of the partition.
> > This is just unnecessary since all other entries in the partition satisfy
> > this condition. If this is something we desire to enumerate we can get a
> > reverse Cursor on the ndn index and advance past the first entry, to start
> > enumerating all the descendant identifiers of the context entry.
> >
> > This index plays a critical role in subtree search. It allows the search
> > optimizer to detect a count and annotate the modified search filter for
> > subtree scope nodes. It also allows a Cursor to enumerate the set of
> > descendants associated with an entry. Without this index, the search engine
> > must resort to expensive DN based operations for subtree scope constraint
> > checking."
> > ======== From Doco =======
> >
> > Now with these two indices as opposed to the one we had we can better
> > constrain subtree search.  The subtree search will limit the candidates as
> > will the assertions in the filter as it was given by the user.
> >
> > But I think you mean more and I think I am catching on to it.  You want
> > to constrain individual assertions off an index with scope at each level
> > rather than at just the top perhaps.  I have not thought about this at all
> > so this is a good time to explore it.   Here's what the search engine does
> > to introduce scope into the AST:
> >
> >
> > http://cwiki.apache.org/confluence/display/DIRxSRVx11/SearchEngine+and+Optimizer
> >
> > So what would taking the conjunction ('AND' or intersection) of the
> > scope node with each leaf assertion do to influence the search plan and cost
> > of searching instead of taking the conjunction only at the top?
> >
> > I guess it depends on the kind of logic in the filter.  If we have a
> > disjunction (OR) then applying the scope constraint to all child
> > sub-expressions could seriously reduce the IO/processing of search
> > operations while doing the same work.  Take this example:
> >
> > (| (ou=engineering)[100] (c=US)[1000] (l=Jacksonville)[10])
> >
> > => 1110 worst case for entire filter
> >
> > If under scope constraints these numbers are reduced by a factor of 10:
> >
> > (| (ou=engineering)[10] (c=US)[100] (l=Jacksonville)[1])
> >
> > => 111 worst case for entire filter
> >
>
> More thoughts ...
>
> So if 1000 entries are in scope.  Then the filter above with the first set
> of scan counts would result in a Cursor over the subtree index (presuming
> subtree search scope parameter value).  This is because 1000 < 1110.  This
> is what happens when the scope node is AND'ed up at the top instead of being
> pushed deeper into the AST in multiple places.
>
> If we push the scope node down to constrain each assertion in 3 places as
> well as at the top then the Cursor will still be built on the scope node at
> the top most conjunction.  This is because and'ing the scope node in lower
> layers still results in the same scan counts.  For example
>
> (& (ou=engineering)[100] (scope)[1000] )
>
> still has a total scan count of 100.  So the filter AST with scope
> constraints applied all over will still build the candidate generating
> cursor on the topmost conjunction's scope.  It will require 1000 iterations
> with 1000*6 lookups now.  Without pushing the scope into leaves we get 1000
> iterations with 1000*3 lookups.
>
> So this is not so much a great way to implement your idea.
>
> Alex
>

Mime
View raw message