directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <akaras...@gmail.com>
Subject Re: Search engine performances...
Date Thu, 11 Aug 2016 16:05:48 GMT
Hi Em,

The substring filter is a big PITA for sure. As you know the substring filters with a fixed
prefix like the one you show here with ‘abc’ prefix is the best kind we can get for implementing
a realistic scan count: much better than a ‘*xyz’. Maybe we can use tricks on the index
if it exists for these classes of substring expressions. Like for example advancing to the
first and last ‘abc’ prefix occurrence to figure out an accurate scan count.

An approach that could be taken with the class of substring expressions with suffix terms
(i.e. ‘*xyz’ ) is to use reverse string indices in addition to the current forward string
index. This comes with a cost though of building and maintaining the index. However it would
speed up most classes of substring expressions.  

The other substring filter expressions without prefixes or suffixes would require an inhibitive
full scan of the index: i.e. ‘*klm*’. Pointless to do and would clear cache memory with
the churn. So your 10% of total size thingy if configurable by the administrator makes sense
as a best guess before the optimizer goes to work.

Best,
Alex

> On Aug 11, 2016, at 7:57 AM, Emmanuel Lécharny <elecharny@gmail.com> wrote:
> 
> Hi guys,
> 
> 
> I have checked the way we process a search as we have poor performances
> with search likes : (&(cn=abc*)(objectClass=person)) (see DIRSERVER-2062).
> 
> 
> I have found two issues.
> 
> 1) for substring filters, as we have no way to know directly how many
> candidates we will have, we do a blind guess :
> 
>    public long greaterThanCount( K key ) throws IOException
>    {
>        // take a best guess
>        return Math.min( count, 10L );
>    }
> 
> so if the count is greater than 10, we always pick 10, regardless of the
> index size. The resulting evaluation mighjt very well be totally wrong.
> Let say we have 11 candidates that match the (objectClass=person)
> filter, but 10000 candidates that match the (cn=abc*) filter, we will
> still use this filter, pulling 10000 entries from the database.
> 
> 
> 2) for an equality filter, where the attributeType is multi-valued, like
> (objectClass=person), we will try to count the candidates. If the values
> are stored in an array, or in a sub-btree, we will read the full
> array/btree and create a set of candidates. That can be costly when we
> have thousands of candidates.
> 
> 
> Actually, the search engine first annotate the filter, then process it.
> The annotation step will sometime construct a set of candidates, even if
> we don't use it, just because we need to compute an estimation of what
> would be the best index to use. In the filter I shown before, the
> substring filter will simply return a count, not a set of candidates,
> while the second filter will construct a set of candidates, when the
> count is immediately available. Moerover, we discard this set when it
> gets bigger than an arbitrary number (100).
> 
> I would rather propose we don't build the candidate set if the number is
> above a threshold (100, for instance), and only return the real count.
> Later on, anyway, we will build the set of candidate if this index is
> selected.
> 
> 
> For the first problem, I would suggest we pick a better approximation
> than a magic number (10) : what about a percentage of the index size ?
> (like 10%). If the index contains 10000 entries, then the count will be
> 1000. We can even start to create the set, based on the filter, and stop
> if it gets bigger than a given nulmber (100?). This would provide a
> better value that 'guess-timating' this number...
> 
> 
> wdyt ?
> 


Mime
View raw message