On Thu, Mar 22, 2012 at 7:20 PM, Emmanuel Lécharny <elecharny@gmail.com> wrote:
Hi,

Currently, we create a forward and a reverse index for each index attribute. The forward index is used when we annotate a filter, searching for the number of candidate it filters. The reverse index usage is a bit more subtile.


[Hope you don't mind below I just use concise English to express the process - not trying to insult your description - apologies if it sounds this way.] 

To state this more concisely, the reverse index facilitates rapid Evaluator evaluations of non-candidate generating assertions in the filter AST. As you know, one assertion in the filter (scope assertions included) is selected as the candidate generating assertion which uses a Cursor to generate candidates matching it's logic. The other filter assertions in the AST are used to evaluate whether or not the generated candidates match. 

NOTE: the optimizer annotates the AST's nodes with scan counts and this is used to drive selection of the proper candidate generating assertion in the AST. This is done using a DFS through the AST to find the lowest scan count containing leaf node (an assertion). This reduces the search set.

Let's consider a filter like (cn=jo*). As we don't have any clue about the value (it will start with 'jo', but that's all we know), we can't use the 'cn' forward index.

Yep the scan count annotation used for the cn=jo* assertion will be the total count of entries in the DIB since the BTree cannot give us a count figure. So depending on what scope is used it will most likely be the scope assertion that will drive candidate generation since it will most likely have a smaller scan count.
 
We can't use the 'cn' reverse index either, at least directly. So the engine will use the scope to determinate a list of candidate : {E1, E2, ... En}. Now, instead of fetching all those entries from the master table, what we can do is to use the 'cn' reverse table, as we have a list of entry IDs.

For each entry id in {E1, E2, ... En}, we will check if the 'cn' reverse table contain some value starting with 'jo'.


Yep each candidate generated from the scope assertion E1, E2, ... En will use the ID to look into the reverse BTree and check if the value for that candidate matches jo*.  
 
If th 'cn' index contains the two following table :
forward =
 john --> {E1, E3}
 joe  --> {E2, E3, E5}
 ...

reverse =
 E1 --> john
 E2 --> joe
 E3 --> joe, john
 E5 --> joe
 ...


Yep.
 
then using the reverse table, we will find that the E1, E2, E3 and E5 entry match, when all the other aren't. No need to fetch the E4, ... En entries from the master table.

Now, exploiting this rverse table means we read a btree. Which has the same cost than reading the Master Table (except that we won't have to deserialize the entry).


IO time agreed.  
 
What if we don't have a reverse table ?

We will have to deserialize the entries, all of them from the {E1, E2, ... En} set filtered by the scope.

Is this better then to build a reverse index ?

The only issue with this is that it will churn the entry cache. Meaning there's some proximity value in a settled cache due to the kinds of queries that generally occur. Optimally a cache should contain those entires most often looked up.

A large master table scan will whip away a settled cache's knowledge. Using a reverse index instead has more value in this case. It's a delicate balance.
 
Not necessarily. In the case where we have more than one node in the filter, for instance (&(cn=jo*)(sn=test)(objectClass=person)), then using the reverse index means we will access as many btrees as we have index attributes in the filter node. Here, if cn, sn and objectClass are indexed, we will potentially access 4 btrees (the scope will force us to consider using the oneLevel index or the subLevel index).

At the end, it might be more costly, compared to using the entry and match it against the nodes in the filter.


Interesting point! The scan counts might help us out on a better optimization for these kinds of cases. 

If the search set is constrained (below some configurable threshold i.e. 10-50 entries) and if the filter uses many indices (above some threshold i.e. 3-4 indices) then it might be better to just pull from the master table directly without leveraging indices.

This will optimize for speed without blowing out the cache memory. What do you think?
 
When we have many entries already in the cache, thus sparing the cost of a deserialization, then accessing more than one BTree might be costly, comparing to using the entries themselves.


Agreed but again let me stress protecting the cache memory from a large master table scan. I think we can take this strategy for the cases mentioned above.
 
An alternative
--------------

The pb is that the Node in the filter use a substring : jo*. Our index are built using the full normalized value. That does not fit.


+1
 
What if instead of browsing the underlying tree, we use a specific compare() method instead of an equals() method ? As the tree is ordered, finding the subset of entries taht will be valid candidate is just a matter of finding the part of the tree which match the comparison.

In our example, the compare() method will compare the first caracters from the keys to the filter value. Here, the compare method will be the String.startWith(), and if it's not true, we will do a compareTo() to know if we should go donw th tree on the left or on the right of the current key.


You know I think we might be doing this. I remember lining up a Cursor on these boundaries for the sake of evaluation but this code might NOT be executed because the scan counts are using total DIB count for substring expressions.
 
For that, we just need the forward index, the reverse index is useless.


I think this works for when you're generating candidates for an assertion using the substring operator. The reverse lookups will still be needed on the generated candidates and doing away with this is not a wise idea for large search domains since it is guaranteed to wipe away the cache memory.  
 
If we use a substring with a '*' at the beginning, then we need another index : a reverted index. Tht means all the values will be stored reverted : john will be stored as 'nhoj', so doing a search on (cn=*hn) will be fast.


Yes we thought about doing this to handle substrings that have ONLY an 'end' component to them but we decided not to deal with it. This is perhaps the worst one of the most inefficient constructs minus the NOT operator.

I think removing the reverse index (which I would love to do) is something that can be experimented with but we don't have a diverse test set to show the true impact across different scenarios. And until we have a way to understand the impact across different kinds of DIBs I'm afraid our reasoning for one situation verses another is not good enough to warrant the change. This is not a simple matter we should come to a decision on with even an intelligent guess.

--
Best Regards,
-- Alex