directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny <>
Subject Search engine performances...
Date Thu, 11 Aug 2016 11:57:00 GMT
Hi guys,

I have checked the way we process a search as we have poor performances
with search likes : (&(cn=abc*)(objectClass=person)) (see DIRSERVER-2062).

I have found two issues.

1) for substring filters, as we have no way to know directly how many
candidates we will have, we do a blind guess :

    public long greaterThanCount( K key ) throws IOException
        // take a best guess
        return Math.min( count, 10L );

so if the count is greater than 10, we always pick 10, regardless of the
index size. The resulting evaluation mighjt very well be totally wrong.
Let say we have 11 candidates that match the (objectClass=person)
filter, but 10000 candidates that match the (cn=abc*) filter, we will
still use this filter, pulling 10000 entries from the database.

2) for an equality filter, where the attributeType is multi-valued, like
(objectClass=person), we will try to count the candidates. If the values
are stored in an array, or in a sub-btree, we will read the full
array/btree and create a set of candidates. That can be costly when we
have thousands of candidates.

Actually, the search engine first annotate the filter, then process it.
The annotation step will sometime construct a set of candidates, even if
we don't use it, just because we need to compute an estimation of what
would be the best index to use. In the filter I shown before, the
substring filter will simply return a count, not a set of candidates,
while the second filter will construct a set of candidates, when the
count is immediately available. Moerover, we discard this set when it
gets bigger than an arbitrary number (100).

I would rather propose we don't build the candidate set if the number is
above a threshold (100, for instance), and only return the real count.
Later on, anyway, we will build the set of candidate if this index is

For the first problem, I would suggest we pick a better approximation
than a magic number (10) : what about a percentage of the index size ?
(like 10%). If the index contains 10000 entries, then the count will be
1000. We can even start to create the set, based on the filter, and stop
if it gets bigger than a given nulmber (100?). This would provide a
better value that 'guess-timating' this number...

wdyt ?

View raw message