directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Howard Chu <...@symas.com>
Subject Re: Search engine performances...
Date Thu, 11 Aug 2016 21:42:35 GMT
Alex Karasulu wrote:
>
>> On Aug 11, 2016, at 4:20 PM, Howard Chu <hyc@symas.com
>> <mailto:hyc@symas.com>> wrote:
>>
>> Emmanuel Lécharny wrote:
>>> Le 11/08/16 à 18:05, Alex Karasulu a écrit :
>>>> The other substring filter expressions without prefixes or suffixes would
>>>> require an inhibitive full scan of the index: i.e. ‘*klm*’. Pointless
to
>>>> do and would clear cache memory with the churn. So your 10% of total size
>>>> thingy if configurable by the administrator makes sense as a best guess
>>>> before the optimizer goes to work.
>>>
>>> There are other options, like the ones implemented by OpenLDAP, OpenDJ,
>>> OpenDS : build indexes based on part of the values. For instance, let
>>> say we have entry 1 : 'cn=A value', we can imagine indexing every 3
>>> consecutive letters for this value. The index will then contain things
>>> like :
>>>
>>> 'A v' -> entry 1
>>> ' va' -> entry 1
>>> 'val' -> entry 1
>>> 'alu' -> entry 1
>>> 'lue' -> entry 1
>>>
>>> Now, searching for '*lue' will brings entry 1 immediately, as searching
>>> for 'A v*', as searching for '*val*'
>>>
>>> That comes with a cost : the index will be huge. Openldap and other
>>> allows you to tune this index in many ways. Typically, Openldap has the
>>> following configuration parameters to tune the index :
>>> *index_substr_if_minlen, **index_substr_if_maxlen,
>>> **index_substr_any_len, **index_substr_any_step
>>>
>>> The XXX_step parameters is by default set to 2, which allows to split
>>> the index by a factor 2, but will require an average of 1.5 lookups if
>>> the entry is present, or 2 lookups if it's not present. With a bigger
>>> step the index will be even smaller (so there is a gain in the index
>>> size) but will require more lookups.
>>>
>>> This also solves the *xyz problem : you don't need an extra revert index.
>>>
>>> Note that it's not a perfect solution either : if the substring in the
>>> filter is bigger than the size of the indexed splitted value, you are
>>> more likely to get duplicates (and wrong ones). OTOH, it gives an
>>> accurate number of candidate, compared to teh holistic approach we use...
>>>
>>> The best solution does not exist. The only way to get an accurate count
>>> would be to let the user to fine tune the index (or to have a smart
>>> indexer, that evolves through the analysis of the search request being
>>> done... Not likely to be implemented soon ;-)
>>> *
>>
>> Substring indexing is always a challenge, but I think trigram-based as we've
>> implemented it is as near to optimal as exists. It has an additional
>> weakness in that you can't use the index for shorter terms. E.g. we can't
>> find "ab*" or "a*" with a 3-character substring index.
>>
>> The other benefit from this approach is the space tradeoff, we know our
>> index is a constant factor larger than the original string. With
>> string-reverse indexing, your index size is n^2/2…
>
> Hi Howard !!! Hope all is well with you.

Heya Alex, good to see ya.
>
> You got me curious. What’s the index space consumption order with the trigram
> based approach?

It's basically N/index_substr_any_step * index_substr_any_len. In OpenLDAP the 
actual index is a hash, so it's a constant, and not dependent on the 
configured index_substr_any_len.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Mime
View raw message