directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Howard Chu <...@symas.com>
Subject Re: Search engine performances...
Date Thu, 11 Aug 2016 20:20:30 GMT
Emmanuel Lécharny wrote:
> Le 11/08/16 à 18:05, Alex Karasulu a écrit :
>> The other substring filter expressions without prefixes or suffixes would require
an inhibitive full scan of the index: i.e. ‘*klm*’. Pointless to do and would clear cache
memory with the churn. So your 10% of total size thingy if configurable by the administrator
makes sense as a best guess before the optimizer goes to work.
>
> There are other options, like the ones implemented by OpenLDAP, OpenDJ,
> OpenDS : build indexes based on part of the values. For instance, let
> say we have entry 1 : 'cn=A value', we can imagine indexing every 3
> consecutive letters for this value. The index will then contain things
> like :
>
> 'A v' -> entry 1
> ' va' -> entry 1
> 'val' -> entry 1
> 'alu' -> entry 1
> 'lue' -> entry 1
>
> Now, searching for '*lue' will brings entry 1 immediately, as searching
> for 'A v*', as searching for '*val*'
>
> That comes with a cost : the index will be huge. Openldap and other
> allows you to tune this index in many ways. Typically, Openldap has the
> following configuration parameters to tune the index :
> *index_substr_if_minlen, **index_substr_if_maxlen,
> **index_substr_any_len, **index_substr_any_step
>
> The XXX_step parameters is by default set to 2, which allows to split
> the index by a factor 2, but will require an average of 1.5 lookups if
> the entry is present, or 2 lookups if it's not present. With a bigger
> step the index will be even smaller (so there is a gain in the index
> size) but will require more lookups.
>
> This also solves the *xyz problem : you don't need an extra revert index.
>
> Note that it's not a perfect solution either : if the substring in the
> filter is bigger than the size of the indexed splitted value, you are
> more likely to get duplicates (and wrong ones). OTOH, it gives an
> accurate number of candidate, compared to teh holistic approach we use...
>
> The best solution does not exist. The only way to get an accurate count
> would be to let the user to fine tune the index (or to have a smart
> indexer, that evolves through the analysis of the search request being
> done... Not likely to be implemented soon ;-)
> *

Substring indexing is always a challenge, but I think trigram-based as we've 
implemented it is as near to optimal as exists. It has an additional weakness 
in that you can't use the index for shorter terms. E.g. we can't find "ab*" or 
"a*" with a 3-character substring index.

The other benefit from this approach is the space tradeoff, we know our index 
is a constant factor larger than the original string. With string-reverse 
indexing, your index size is n^2/2...

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Mime
View raw message