directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny <elecha...@gmail.com>
Subject [index] reverse index usage for user attributes
Date Thu, 22 Mar 2012 17:20:11 GMT
Hi,

Currently, we create a forward and a reverse index for each index 
attribute. The forward index is used when we annotate a filter, 
searching for the number of candidate it filters. The reverse index 
usage is a bit more subtile.

Let's consider a filter like (cn=jo*). As we don't have any clue about 
the value (it will start with 'jo', but that's all we know), we can't 
use the 'cn' forward index. We can't use the 'cn' reverse index either, 
at least directly. So the engine will use the scope to determinate a 
list of candidate : {E1, E2, ... En}. Now, instead of fetching all those 
entries from the master table, what we can do is to use the 'cn' reverse 
table, as we have a list of entry IDs.

For each entry id in {E1, E2, ... En}, we will check if the 'cn' reverse 
table contain some value starting with 'jo'.

If th 'cn' index contains the two following table :
forward =
   john --> {E1, E3}
   joe  --> {E2, E3, E5}
   ...

reverse =
   E1 --> john
   E2 --> joe
   E3 --> joe, john
   E5 --> joe
   ...

then using the reverse table, we will find that the E1, E2, E3 and E5 
entry match, when all the other aren't. No need to fetch the E4, ... En 
entries from the master table.

Now, exploiting this rverse table means we read a btree. Which has the 
same cost than reading the Master Table (except that we won't have to 
deserialize the entry).

What if we don't have a reverse table ?

We will have to deserialize the entries, all of them from the {E1, E2, 
... En} set filtered by the scope.

Is this better then to build a reverse index ? Not necessarily. In the 
case where we have more than one node in the filter, for instance 
(&(cn=jo*)(sn=test)(objectClass=person)), then using the reverse index 
means we will access as many btrees as we have index attributes in the 
filter node. Here, if cn, sn and objectClass are indexed, we will 
potentially access 4 btrees (the scope will force us to consider using 
the oneLevel index or the subLevel index).

At the end, it might be more costly, compared to using the entry and 
match it against the nodes in the filter.

When we have many entries already in the cache, thus sparing the cost of 
a deserialization, then accessing more than one BTree might be costly, 
comparing to using the entries themselves.

An alternative
--------------

The pb is that the Node in the filter use a substring : jo*. Our index 
are built using the full normalized value. That does not fit.

What if instead of browsing the underlying tree, we use a specific 
compare() method instead of an equals() method ? As the tree is ordered, 
finding the subset of entries taht will be valid candidate is just a 
matter of finding the part of the tree which match the comparison.

In our example, the compare() method will compare the first caracters 
from the keys to the filter value. Here, the compare method will be the 
String.startWith(), and if it's not true, we will do a compareTo() to 
know if we should go donw th tree on the left or on the right of the 
current key.

For that, we just need the forward index, the reverse index is useless.

If we use a substring with a '*' at the beginning, then we need another 
index : a reverted index. Tht means all the values will be stored 
reverted : john will be stored as 'nhoj', so doing a search on (cn=*hn) 
will be fast.

Thoughts ?

-- 
Regards,
Cordialement,
Emmanuel L├ęcharny
www.iktek.com


Mime
View raw message