From Mon Mar 26 15:18:27 2012 Return-Path: X-Original-To: Delivered-To: Received: from ( []) by (Postfix) with SMTP id 97D339E06 for ; Mon, 26 Mar 2012 15:18:27 +0000 (UTC) Received: (qmail 62462 invoked by uid 500); 26 Mar 2012 15:18:27 -0000 Delivered-To: Received: (qmail 62427 invoked by uid 500); 26 Mar 2012 15:18:27 -0000 Mailing-List: contact; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: Delivered-To: mailing list Received: (qmail 62419 invoked by uid 99); 26 Mar 2012 15:18:27 -0000 Received: from (HELO ( by (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2012 15:18:27 +0000 X-ASF-Spam-Status: No, hits=-1994.3 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,MIME_HTML_ONLY X-Spam-Check-By: Received: from [] (HELO ( by (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2012 15:18:21 +0000 Received: from thor (localhost []) by (8.13.8+Sun/8.13.8) with ESMTP id q2QFI0V5018491 for ; Mon, 26 Mar 2012 15:18:00 GMT Date: Mon, 26 Mar 2012 11:18:00 -0400 (EDT) From: To: Message-ID: <12032371.52985.1332775080026.JavaMail.confluence@thor> Subject: [CONF] Apache Directory Server v1.5 > SearchEngine and Optimizer MIME-Version: 1.0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Auto-Submitted: auto-generated

SearchEngine and Optimizer

Page edited by Emmanuel L=C3=A9charny

Changes (1)

=20 =20
=20 =20
* the attribute to return for= entries

{warning:title=3DNormalizati= on of Attributes to Return}
Presently the SearchOperationContext contai= ns a JNDI SearchControls instance to encapsulate some of these parameters. = The attribute identifiers used in the SearchControls for the attributes to= return may or may not be normalized to the attributeType OID. These shoul= d be normalized to provide either a set of AttributeType OID's or a set= of AttributeType objects themselves.


Full Content

Introduct= ion

The SearchEngine conducts search operations on a Partition using this in= dexed master table scheme. It requires the presence of the system indices = mentioned to evaluate LDAP filters on the part of the DIT contained in such= a Partition. SearchEngine is an interface with a default implementation w= hich would even if we implemented such a partition using some other BTree i= mplementation besides JDBM. We could use a custom in memory BTree implemen= tation like an AvlTree as the basis for Tables for a new kind of Partition,= and the DefaultSearch engine will be able to conduct search operations on = it. We could even use SleepyCat's BDB JE for another disk based BTree impl= ementation for Tables, to build a new kind of Partition based on this schem= e while using the DefaultSearchEngine to search it.

Optimizer and it's default implementation, the DefaultOptimizer, annotat= es search filter AST's to provide optimization hints to the SearchEngine. = Like the SearchEngine the optimizer implementation can operate on any Parti= tion implementation using this indexed master table scheme containing the s= ame required system indices.

In this document we define the search algorithm and optimization techniq= ues applied by the DefaultSearchEngine and the DefaultOptimizer.

Search Parame= ters

ApacheDS provides all partitions with parameters to a search operation u= sing a SearchOperationContext object. This object contains:

  • the search scope
  • =09
  • the search base dn
  • =09
  • the alias dereferencing mode
  • =09
  • the search filter as an abstract syntax tree
  • =09
  • the search time and size limits
  • =09
  • the attribute to return for entries

Search Filter = AST

Most of the search parameters are self explanatory with the exception of= the search filter abstract syntax tree. For each search request ApacheDS = produces an AST representation of the search filter. The branch nodes of t= his filter tree represent the boolean filter operators AND (&= ), OR (|), NOT (!). The leaf nodes of = this filter represent atomic assertions in the filter. See the example fil= ter and the AST representation for it:

*(& (| (organizationalUnitName=3DSales) (OU=3DBOARD    of director=
s) )=20
    (! (localityName=3DsUnnYVale) )=20
    ( )

After generating the AST for this filter, the server will normalize asse= rtion nodes (the leaf nodes) and pass the resulting AST to the PartitionNex= us. The PartitionNexus forwards this to the partition containing the searc= h base. Here's the normalized version of this filter and it's AST:

*(& (| ( ( of directors) )=20
    (! ( )=20
    ( )

Scope Prepara= tion

Upon receiving the normalized search filter AST, the default search engi= ne performs some basic scope analysis. The results of this analysis produc= e a special tree node called a ScopeNode. The scope node is injected into = the filter AST rearranging it. A new AND node is used as the root. = The old AST is inserted into the new root as a child. The scope node is a= lso inserted as a child. To the left is the resultant modified filter AST = using our example filter.

The ScopeNode embeds an additional scope constraint assertion into the f= ilter. It enforces the scope constraint while conducting search operations= . The partition adds this node now before invoking the optimizer so scope = constraints can be considered for optimization. The optimizer is aware and= knows how to handle this special ScopeNode containing all the need scope i= nformation or this search operation.

Alias Deref= erencing

The ScopeNode does not represent the simple scope parameter for one, sin= gle and subtree. It represents the overall scope of filter applicability: = the scope constrained candidate set a filter should be applied to. This fa= ctors in scope expansion due to alias dereferencing. Each alias encountere= d while dereferencing in any of the modes is added as a unique base to sear= ch. This represents the complete scope.

As mentioned before in the Structure and= Organization page, a set of aliases are found which refer to entries e= xpanding the scope. These aliases are dereferenced to list all the unique = roots that must now be searched to correctly evaluate the filter across thi= s expanded scope.

Op= timization with Annotations

The next step in the process of preparing a filter AST is to annotate it= with cost annotations. This is the task of optimizer. The default optimi= zer uses scan counts on indices to determine costs. Each node in the AST i= nlcuding branch nodes, and the scope node are has a count attribute associa= ted with it. The scan count approach used by the default optimizer approxi= mates a best guess on the worst case scenario. The count is a guess on how= big the result set could be for entries satisfying the constraints in the = filter subtree.

<= /tr>
3D""The count attribute should be renamed to= cost since different optimizer implementations may use different tactics b= esides simple scan counts to optimize search. The default optimizer just u= ses index scan counts. This may change or be enhanced in the future.

The optimizer performs a depth first traversal of the AST annotating the= children of branch nodes, before trying to annotate the branch node itself= . When simple leaf nodes are entered, the optimizer checks to see if an in= dex exists for the assertion attribute of the leaf node. If an index is av= ailable, it is used to get a quick count of the number of entries in the pa= rtition that would satisfy the asserted condition in the leaf node. This f= igure is set as the count annotation on the leaf node. A technique exists = for quickly looking up scan counts for assertion operators (>, <, =3D= , ~=3D, =3D*, approx, substr, scope).

Substring Assertions Are Not Cheap
ApacheDS does not use a specialized index for substring operator evaluati= on. Instead it walks an equality/ordering index with regular expression eva= luators. This is costly since it requires pattern matching on anything but= the most simplistic cases. Anything that can limit the region of the inde= x scanned is worth using. For example if ou is indexed and the following f= ilter is used, (ou=3DEng*), then the index can natively limit our scan to o= nly those entries starting with 'Eng' without having to use any regular exp= ression evaluations. If instead the following filter is used, (ou=3DEng*Ma= nagers), then we still limit the range natively to those index entries wher= e ou starts with Eng, however a regular expression must be used to make sur= e the value ends with 'Managers'. One would think the worst case is (ou=3D= *) but this is not the substring operator, it is the existence operator whi= ch can quickly find the scan count which is just the size of the index: no = regular expression is needed. The worse of the worst cases is a substring = search expression without an initial anchor (no init component) like (ou=3D= *Managers). Such a substring filter assertion will result in a full index = scan while using a regular expression on each index entry to assert whether= or not the ou value ends in 'Managers'. The optimizer avoid this and take= s an educated guess, otherwise it would spend too much time trying to optim= ize making it's function worthless.

Once all the children of a branch node have been annotated, the branch n= ode must be annotated using the annotation results from it's children.

AND nodes are annotated with the lowest scan count of it's childr= en. This is the absolute worst case, or maximum result set size, that coul= d be accepted by the logic in this AND branch of the filter. This m= akes sense since logical AND operations compute the intersection. T= he most that could intersect is the smallest set.

OR nodes are annotated with the sum of the it's children's scan c= ounts. This is the absolute worst case, or maximum result set size. Again= this makes sense since the logical OR operation computes the union = of the sets. The most that could be returned is the sum of all possible en= tries that could be returned by OR node children.

NOT nodes contain a single child. With the present indexing sche= me the best we can do is subtract the scan count of the child, from the siz= e of the complete set of entries. This works as the absolute worst case an= d makes sense since logical NOT operations invert the selection.

3D""Perhaps not here but we need to make a n= ote about how filters with NOT operations could potentially result in full = index scans which are extremely costly in a very large directory (VLD). So= me possible tactics may remove these problems at a cost. For example we th= ought about adding an inverse LUT to indices just for this purpose but that= was not feasible: how do you generate all the combinations for what values= an entry attribute does not have?

This process is continued until the root of the AST is reached. At this= point we have a rough estimate of the worst case for the entire filter.

Search Algorithm, and Cursor Construction

The call to search() on the search engine does not synchronously = search the partition to construct a result set to return. Doing so would r= equire potentially massive buffers to store the results and increase respon= se latency. An efficient and fast server must minimize the memory footprin= t of each active search request so it can process as many concurrent reques= ts as possible.

Instead of computing the results, the search engine constructs a Cursor = that can produce the resultant entries matching the filter and scope constr= aints. This Cursor may be a composite cursor potentially wrapping other Cu= rsors depending on the filter. Once the Cursor is constructed and returned= to be used by higher levels, it can traverse the results one by one withou= t having to buffer the entire result set matching the filter. ApacheDS' LD= AP front end can then advance forward or backwards, or even to the first or= last result using this Cursor. Usually the front end will pull a candidat= e entry and send it out the door before getting the next entry to prevent h= aving to buffer more than one entry at a time. Calls on the Cursor interfa= ce to position it leverage partition indices and filter assertion logic to = compute the next or previous candidate. The Cursor is a critical pattern i= n the efficient operation of the search algorithm.

Assertion Evaluators and Candidate Cursors

There are many ways a Cursor can be built to traverse candidates that sa= tisfying a filter. Some ways are less efficient than others, in that they = will result in more computation and require more IO to evaluate the same fi= lter with each positioning and advance. The search engine uses the scan co= unt annotations on nodes to govern how it builds an optimal Cursor. It use= s a recursive Cursor building visitor on the nodes of the AST. The visitor= takes actions to create different Evaluators and Cursors based on the node= type. Cursors generate candidates as IndexEntry objects that satisfying f= ilter expressions. Evaluators evaluate expressions on candidates generated= from Cursors returning IndexEntry objects.

At branch nodes, the visitor makes a decision how to build candidate gen= erating Cursors and Evaluators.

At NOT branch nodes, the visitor creates a Cursor on the NDN inde= x. Since NOT branch node types only possess a single child, that ch= ild is used to create a single Evaluator. The Evaluator and Cursor pair ar= e used to build a NotCursor. On advances and positioning attempts, the Not= Cursor uses the NDN index Cursor to generate candidates, then checks to mak= e sure the Evaluator returns false for that candidate. NotCursors incur a = full table scan on the NDN index since they must make sure the child filter= does not match a candidate to accept the candidate. There is no way to bu= ild a Cursor on the child expression and figure out the negation.

At AND branch nodes, the visitor selects a child with the smalles= t scan count and uses that child AST node to build a candidate generating C= ursor. The siblings of the selected child are used to build a list of Eval= uators. The Cursor and list of Evaluators are used to construct an AndCurs= or. On advances and positioning operations the Cursor finds candidates tha= t satisfy the AND node expression by using the child Cursor to gener= ate candidates while using the list of Evaluators to verify a match. All s= ibling Evaluators must match the candidate for the AndCursor to consider a = candidate as having matched the AND node's filter expression.

At OR branch nodes, the visitor creates a list of Cursors for eac= h child node. These Cursors are combined to build an OrCursor. The OrCurs= or walks these child Cursors one at a time. Since an OR operation t= akes the union of all candidates, each child Cursor's candidate is accepted= by the OrCursor.

Both Enumerator building and Cursor building operations are recursive. = Both stop recursing when they reach leaf nodes which either produce a simpl= e leaf expression Evaluator or a Cursor off of a partition index associated= with the leaf node's assertion attribute.

Comments on S= earch

Annotations only help optimize search when search expressions contain AND expressions. As noted above the scope node addition automatically= makes every AST fed into the optimizer a conjunction. Plus the scope node= can efficiently produce scan count information thanks to the onelevel and = sublevel indices. So whatever the filter, leveraging the optimizer will he= lp.

We mentioned that some Cursors may not be optimal because they could gen= erate more IO and require more calculations to determine candidacy on Curso= r positioning and advances. To understand this let's consider the Cursors w= e could build on an example filter expression:

(& (ou=3Dengineering) (l=3DSunnyvale))

For simplicity we will not consider scope in this example at first. Als= o for now consider there to be an index on the ou and l attributes. Accord= ing to this filter expression we can build two kinds of Cursors.

The first Cursor would generate candidates from the ou index using a Cur= sor for ou=3Dengineering leaf node. Each candidate returned represents an = entry in the partition who's ou attribute equals the value 'engineering'. = The Cursor would use the localityName (l) index to perform lookups for each= candidate returned from the ou=3Dengineering Cursor. Lookups will check t= o make sure an IndexEntry exists with id equal to the candidate and value e= qual to 'sunnyvale'.

The second Cursor would generate candidates from the localityName index = for l=3Dsunnyvale leaf node. Each candidate returned represents an entry i= n the partition who's localityName equals the value 'sunnyvale'. The Curso= r would use the ou index to perform lookups for each candidate returned fro= m the l=3Dsunnyvale Cursor. Lookups will check to make sure an IndexEntry = exists with id equal to the candidate and a value equal to 'engineering'

If there are 100 people in the engineering department yet 1000 people in= the company are located in Sunnyvale then the first Cursor will perform 10= 0 iterations as it walks the ou index over Tuples with a key of 'engineerin= g'. For each candidate (of which there are 100), this Cursor will perform = a single lookup on the localityName index. To satisfy this search a total = of 100 Cursor advances occur with 100 lookup operations. In the case of th= e second Cursor, 1000 iterations are performed as it walks the localityName= index over Tuples with a key of 'sunnyvale'. For each candidate (of which= there are 1000), this second Cursor will perform a single lookup on the ou= index. To satisfy this search a total of 1000 Cursor advances occur with = 1000 lookup operations.

Choosing to build the second less efficient Cursor rather than the first= Cursor costs 10 times as much in IO and processing. This is why the use o= f scan counts set by the optimizer are used to select the child with the sm= allest scan count in AND nodes. And as stated before, the addition = of the scope node, always includes an AND node within the AST. Henc= e the optimizer should pay off in most search operations. This is why it i= s enabled by default in the JdbmPartition.

Composite Ind= ices

Talk about future work regarding indices on common search filters. Thes= e indices represent filter matches.