directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny <elecha...@apache.org>
Subject Re: Index reverse tables : are they useful ?
Date Fri, 24 Jun 2011 07:47:40 GMT
On 6/24/11 9:40 AM, Alex Karasulu wrote:
> On Thu, Jun 23, 2011 at 7:20 PM, Emmanuel Lecharny<elecharny@gmail.com>  wrote:
>> Hi guys,
>>
>> as I was reviewing the Table interface, and the Index hierarchy, I had a bit
>> of time to try to understand this part of the server I was not comfortable
>> with. And I saw that each index is using two tables : a forward and a
>> reverse table.
>>
>> The forward table is obviously mandatory. It is used to retrieve the list of
>> entry's ID from an AT value. Fine. But what about the reverse table ?
>>
>> Alex told me that it's useful when dealing with such filters :
>> (&(cn=A)(sn=B)).
>>
>> The optimization engine will get the index that provides the smallest set of
>> candidate (say, CN here). We then grab from the CN forward table all the
>> entry's ID associated with the 'A' value. Let's call it the CN-IDs set.
>>
>> Now, if we were to grab all the entries from the MasterTable to compare
>> their SN attribute with the 'B' value, t would be quite costly. We use the
>> SN's reverse table. For each entry's ID, we check if the associated value in
>> the SN's reverse table exists or not. If it exists, we have a valid
>> candidate, otherwise, we simply discard the candidate.
>>
>> As we can see, we never fetch the entry unless we have a valid candidate (of
>> course, if the SN and CN AT are indexed).
>>
>> However, I do think this reverse table is spurious. We can get the same
>> result by using the SN's forward table : for each entry's ID, check in the
>> SVn's forward table if the set of values (SN-IDs) contains the ID we pulled
>> from the CN-IDs set.
>>
>> We can even do an SN-IDs ^ CN-IDs ( '^'<=>  intersection) to get the list of
>> valid candidates. No need to do N fetches in the reverse table (where N is
>> the number of ID in CN-IDs).
>>
>> I think we can remove those reverse table, and I expect that it can speed up
>> the search a bit, but mainly speed up greatly any modification operation
>> (add, delete, modify) and even decrease the disk consumption.
> I think this is possible. It would be interesting to test these two
> different systems using different kinds of data. The number of
> duplicate keys will obviously impact the performance: it will be a
> trade off.
>
> To test an Evaluator (i.e. sn=foo) on a candidate entry say id 37, we
> need to perform a reverse lookup on the sn index. The btree key is 37
> and we check if it contains a tuple with a value of 'foo'. If it does
> then the Evaluator returns true if not it returns false. Now since we
> know we're looking for tuple ('foo', 37) and we only have the forward
> table we can lookup the 'foo' entries, and the values from duplicate
> keys would be sorted. We would just need to check if 37 were in these
> values.
>
> The reverse index has no duplicate keys. The only way to get a
> duplicate key in the reverse index is if the same entry (i.e. 37)
> contained the same value ('foo') for the same (sn) attribute. And this
> we know is not possible. So the lookups against the reverse table will
> be faster.

I was thinking about something a bit different : as soon as you have 
grabbed the list of entry's ID from the first index, looking into the 
other indexes will also return a list of Entry's ID. Checking if those 
IDs are valid candidate can then be done in one shot : do the 
intersection of the two sets (they are ordered, so it's a O(n) 
operation) and just get the matching entries.

Compared to the current processing (ie, accessing the reverse index for 
*each* candidate), this will be way faster, IMO.

Note that it's true for the AND connector, but even for the OR 
connector, we don't have this kind of issue.
>> May be I'm missing something though. I have created a branch
>> (http://svn.apache.org/viewvc/directory/apacheds/branches/apacheds-no-reverse-index/)
>> to play with this idea.
> As long as we're careful and don't just push this into the main branch
> I'm very happy to experiment with this. Let's see really what the
> costs are. Because I can sit around and theorize all day but some
> benchmarks really show us what reality is about.
The main issue so far is with the RDN, oneLevel and Subtree indexes, 
which are using entry's ID, and are representing the tree hierarchy.

It impacts the way we delete entries, too, as we currently use the 
reverse index to quickly retrieve the values to remove from the forward 
index. However, we can just iterate on the entry's attribute to do the 
same thing, invoking the drop( K, Value) method instead of invoking the 
drop( ID ) method.

Very interesting experiment !

I'll commit what I'm coming with today.

-- 
Regards,
Cordialement,
Emmanuel L├ęcharny
www.iktek.com


Mime
View raw message