directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kiran Ayyagari <kayyag...@apache.org>
Subject Re: Search engine performances...
Date Thu, 11 Aug 2016 14:47:19 GMT
On Thu, Aug 11, 2016 at 5:27 PM, Emmanuel L├ęcharny <elecharny@gmail.com>
wrote:

> Hi guys,
>
>
> I have checked the way we process a search as we have poor performances
> with search likes : (&(cn=abc*)(objectClass=person)) (see DIRSERVER-2062).
>
>
> I have found two issues.
>
> 1) for substring filters, as we have no way to know directly how many
> candidates we will have, we do a blind guess :
>
>     public long greaterThanCount( K key ) throws IOException
>     {
>         // take a best guess
>         return Math.min( count, 10L );
>     }
>
> so if the count is greater than 10, we always pick 10, regardless of the
> index size. The resulting evaluation mighjt very well be totally wrong.
> Let say we have 11 candidates that match the (objectClass=person)
> filter, but 10000 candidates that match the (cn=abc*) filter, we will
> still use this filter, pulling 10000 entries from the database.
>
>
> 2) for an equality filter, where the attributeType is multi-valued, like
> (objectClass=person), we will try to count the candidates. If the values
> are stored in an array, or in a sub-btree, we will read the full
> array/btree and create a set of candidates. That can be costly when we
> have thousands of candidates.
>
>
> Actually, the search engine first annotate the filter, then process it.
> The annotation step will sometime construct a set of candidates, even if
> we don't use it, just because we need to compute an estimation of what
> would be the best index to use. In the filter I shown before, the
> substring filter will simply return a count, not a set of candidates,
> while the second filter will construct a set of candidates, when the
> count is immediately available. Moerover, we discard this set when it
> gets bigger than an arbitrary number (100).
>
> I would rather propose we don't build the candidate set if the number is
> above a threshold (100, for instance), and only return the real count.
> Later on, anyway, we will build the set of candidate if this index is
> selected.
>
>
> For the first problem, I would suggest we pick a better approximation
> than a magic number (10) : what about a percentage of the index size ?
> (like 10%). If the index contains 10000 entries, then the count will be
> 1000. We can even start to create the set, based on the filter, and stop
> if it gets bigger than a given nulmber (100?). This would provide a
> better value that 'guess-timating' this number...
>
>
> wdyt ?
>
> +1 to make this change and see

Kiran

Mime
View raw message