directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <akaras...@apache.org>
Subject Re: [index] reverse index usage for user attributes
Date Mon, 26 Mar 2012 12:32:02 GMT
On Thu, Mar 22, 2012 at 7:20 PM, Emmanuel L├ęcharny <elecharny@gmail.com>wrote:

> Hi,
>
> Currently, we create a forward and a reverse index for each index
> attribute. The forward index is used when we annotate a filter, searching
> for the number of candidate it filters. The reverse index usage is a bit
> more subtile.
>
>
[Hope you don't mind below I just use concise English to express the
process - not trying to insult your description - apologies if it sounds
this way.]

To state this more concisely, the reverse index facilitates rapid Evaluator
evaluations of non-candidate generating assertions in the filter AST. As
you know, one assertion in the filter (scope assertions included) is
selected as the candidate generating assertion which uses a Cursor to
generate candidates matching it's logic. The other filter assertions in the
AST are used to evaluate whether or not the generated candidates match.

NOTE: the optimizer annotates the AST's nodes with scan counts and this is
used to drive selection of the proper candidate generating assertion in the
AST. This is done using a DFS through the AST to find the lowest scan count
containing leaf node (an assertion). This reduces the search set.

Let's consider a filter like (cn=jo*). As we don't have any clue about the
> value (it will start with 'jo', but that's all we know), we can't use the
> 'cn' forward index.


Yep the scan count annotation used for the cn=jo* assertion will be the
total count of entries in the DIB since the BTree cannot give us a count
figure. So depending on what scope is used it will most likely be the scope
assertion that will drive candidate generation since it will most likely
have a smaller scan count.


> We can't use the 'cn' reverse index either, at least directly. So the
> engine will use the scope to determinate a list of candidate : {E1, E2, ...
> En}. Now, instead of fetching all those entries from the master table, what
> we can do is to use the 'cn' reverse table, as we have a list of entry IDs.
>
> For each entry id in {E1, E2, ... En}, we will check if the 'cn' reverse
> table contain some value starting with 'jo'.
>
>
Yep each candidate generated from the scope assertion E1, E2, ... En will
use the ID to look into the reverse BTree and check if the value for that
candidate matches jo*.


> If th 'cn' index contains the two following table :
> forward =
>  john --> {E1, E3}
>  joe  --> {E2, E3, E5}
>  ...
>
> reverse =
>  E1 --> john
>  E2 --> joe
>  E3 --> joe, john
>  E5 --> joe
>  ...
>
>
Yep.


> then using the reverse table, we will find that the E1, E2, E3 and E5
> entry match, when all the other aren't. No need to fetch the E4, ... En
> entries from the master table.
>
> Now, exploiting this rverse table means we read a btree. Which has the
> same cost than reading the Master Table (except that we won't have to
> deserialize the entry).
>
>
IO time agreed.


> What if we don't have a reverse table ?
>
> We will have to deserialize the entries, all of them from the {E1, E2, ...
> En} set filtered by the scope.
>
> Is this better then to build a reverse index ?


The only issue with this is that it will churn the entry cache. Meaning
there's some proximity value in a settled cache due to the kinds of queries
that generally occur. Optimally a cache should contain those entires most
often looked up.

A large master table scan will whip away a settled cache's knowledge. Using
a reverse index instead has more value in this case. It's a delicate
balance.


> Not necessarily. In the case where we have more than one node in the
> filter, for instance (&(cn=jo*)(sn=test)(**objectClass=person)), then
> using the reverse index means we will access as many btrees as we have
> index attributes in the filter node. Here, if cn, sn and objectClass are
> indexed, we will potentially access 4 btrees (the scope will force us to
> consider using the oneLevel index or the subLevel index).
>
> At the end, it might be more costly, compared to using the entry and match
> it against the nodes in the filter.
>
>
Interesting point! The scan counts might help us out on a better
optimization for these kinds of cases.

If the search set is constrained (below some configurable threshold i.e.
10-50 entries) and if the filter uses many indices (above some threshold
i.e. 3-4 indices) then it might be better to just pull from the master
table directly without leveraging indices.

This will optimize for speed without blowing out the cache memory. What do
you think?


> When we have many entries already in the cache, thus sparing the cost of a
> deserialization, then accessing more than one BTree might be costly,
> comparing to using the entries themselves.
>
>
Agreed but again let me stress protecting the cache memory from a large
master table scan. I think we can take this strategy for the cases
mentioned above.


> An alternative
> --------------
>
> The pb is that the Node in the filter use a substring : jo*. Our index are
> built using the full normalized value. That does not fit.
>
>
+1


> What if instead of browsing the underlying tree, we use a specific
> compare() method instead of an equals() method ? As the tree is ordered,
> finding the subset of entries taht will be valid candidate is just a matter
> of finding the part of the tree which match the comparison.
>
> In our example, the compare() method will compare the first caracters from
> the keys to the filter value. Here, the compare method will be the
> String.startWith(), and if it's not true, we will do a compareTo() to know
> if we should go donw th tree on the left or on the right of the current key.
>
>
You know I think we might be doing this. I remember lining up a Cursor on
these boundaries for the sake of evaluation but this code might NOT be
executed because the scan counts are using total DIB count for substring
expressions.


> For that, we just need the forward index, the reverse index is useless.
>
>
I think this works for when you're generating candidates for an assertion
using the substring operator. The reverse lookups will still be needed on
the generated candidates and doing away with this is not a wise idea for
large search domains since it is guaranteed to wipe away the cache memory.


> If we use a substring with a '*' at the beginning, then we need another
> index : a reverted index. Tht means all the values will be stored reverted
> : john will be stored as 'nhoj', so doing a search on (cn=*hn) will be fast.
>
>
Yes we thought about doing this to handle substrings that have ONLY an
'end' component to them but we decided not to deal with it. This is perhaps
the worst one of the most inefficient constructs minus the NOT operator.

I think removing the reverse index (which I would love to do) is something
that can be experimented with but we don't have a diverse test set to show
the true impact across different scenarios. And until we have a way to
understand the impact across different kinds of DIBs I'm afraid our
reasoning for one situation verses another is not good enough to warrant
the change. This is not a simple matter we should come to a decision on
with even an intelligent guess.

-- 
Best Regards,
-- Alex

Mime
View raw message