directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <>
Subject Re: Search engine performances...
Date Thu, 11 Aug 2016 19:10:46 GMT

> On Aug 11, 2016, at 2:02 PM, Emmanuel Lécharny <> wrote:
> Le 11/08/16 à 18:05, Alex Karasulu a écrit :
>> Hi Em,
>> The substring filter is a big PITA for sure. As you know the substring filters with
a fixed prefix like the one you show here with ‘abc’ prefix is the best kind we can get
for implementing a realistic scan count: much better than a ‘*xyz’. Maybe we can use tricks
on the index if it exists for these classes of substring expressions. Like for example advancing
to the first and last ‘abc’ prefix occurrence to figure out an accurate scan count.
>> An approach that could be taken with the class of substring expressions with suffix
terms (i.e. ‘*xyz’ ) is to use reverse string indices in addition to the current forward
string index. This comes with a cost though of building and maintaining the index. However
it would speed up most classes of substring expressions.  
>> The other substring filter expressions without prefixes or suffixes would require
an inhibitive full scan of the index: i.e. ‘*klm*’. Pointless to do and would clear cache
memory with the churn. So your 10% of total size thingy if configurable by the administrator
makes sense as a best guess before the optimizer goes to work.
> There are other options, like the ones implemented by OpenLDAP, OpenDJ,
> OpenDS : build indexes based on part of the values. For instance, let
> say we have entry 1 : 'cn=A value', we can imagine indexing every 3
> consecutive letters for this value. The index will then contain things
> like :
> 'A v' -> entry 1
> ' va' -> entry 1
> 'val' -> entry 1
> 'alu' -> entry 1
> 'lue' -> entry 1
> Now, searching for '*lue' will brings entry 1 immediately, as searching
> for 'A v*', as searching for '*val*'
> That comes with a cost : the index will be huge.

Yeah I had seen these “create every substring permutation indices” in other LDAP servers.
I just turned around, said "Oye vey," and walked the other away. 

> Openldap and other
> allows you to tune this index in many ways. Typically, Openldap has the
> following configuration parameters to tune the index :
> *index_substr_if_minlen, **index_substr_if_maxlen,
> **index_substr_any_len, **index_substr_any_step

Yup lots of management overhead, too many one offs, and lots of index bloat potential IMHO.
Plus this can be a long running task drawing out the life of a transactional update if we
have to generate each permutation on substring index updates.

> The XXX_step parameters is by default set to 2, which allows to split
> the index by a factor 2, but will require an average of 1.5 lookups if
> the entry is present, or 2 lookups if it's not present. With a bigger
> step the index will be even smaller (so there is a gain in the index
> size) but will require more lookups.
> This also solves the *xyz problem : you don't need an extra revert index.

Yeah it does but it also increases complexity. However if folks want to deal with that my
hats off to them. 

The reverse index only seems simple and easy. It deals with another major class of substring
expressions having suffixes. I have a feeling most substring expressions are going to have
prefixes and/or suffixes. Even if middle terms are present scan counts on suffix or prefix
terms are enough to get a correct scan count. If both are present the prefix and suffix are
like AND'd expressions where each term shrinks the search space further.

We can throw our hands up in the air when users decide to not include a prefix or suffix:
i.e. STAR MIDDLE_TERMS STAR. The server can just take an educated guess. This seems like the
most sane tradeoff.

> Note that it's not a perfect solution either : if the substring in the
> filter is bigger than the size of the indexed splitted value, you are
> more likely to get duplicates (and wrong ones). OTOH, it gives an
> accurate number of candidate, compared to teh holistic approach we use...
> The best solution does not exist. The only way to get an accurate count
> would be to let the user to fine tune the index (or to have a smart
> indexer, that evolves through the analysis of the search request being
> done... Not likely to be implemented soon ;-)
> *



View raw message