directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <elecha...@gmail.com>
Subject Re: Search engine performances...
Date Fri, 12 Aug 2016 03:36:40 GMT
Le 11/08/16 à 22:20, Howard Chu a écrit :
> Emmanuel Lécharny wrote:
>> Le 11/08/16 à 18:05, Alex Karasulu a écrit :
>>> The other substring filter expressions without prefixes or suffixes
>>> would require an inhibitive full scan of the index: i.e. ‘*klm*’.
>>> Pointless to do and would clear cache memory with the churn. So your
>>> 10% of total size thingy if configurable by the administrator makes
>>> sense as a best guess before the optimizer goes to work.
>>
>> There are other options, like the ones implemented by OpenLDAP, OpenDJ,
>> OpenDS : build indexes based on part of the values. For instance, let
>> say we have entry 1 : 'cn=A value', we can imagine indexing every 3
>> consecutive letters for this value. The index will then contain things
>> like :
>>
>> 'A v' -> entry 1
>> ' va' -> entry 1
>> 'val' -> entry 1
>> 'alu' -> entry 1
>> 'lue' -> entry 1
>>
>> Now, searching for '*lue' will brings entry 1 immediately, as searching
>> for 'A v*', as searching for '*val*'
>>
>> That comes with a cost : the index will be huge. Openldap and other
>> allows you to tune this index in many ways. Typically, Openldap has the
>> following configuration parameters to tune the index :
>> *index_substr_if_minlen, **index_substr_if_maxlen,
>> **index_substr_any_len, **index_substr_any_step
>>
>> The XXX_step parameters is by default set to 2, which allows to split
>> the index by a factor 2, but will require an average of 1.5 lookups if
>> the entry is present, or 2 lookups if it's not present. With a bigger
>> step the index will be even smaller (so there is a gain in the index
>> size) but will require more lookups.
>>
>> This also solves the *xyz problem : you don't need an extra revert
>> index.
>>
>> Note that it's not a perfect solution either : if the substring in the
>> filter is bigger than the size of the indexed splitted value, you are
>> more likely to get duplicates (and wrong ones). OTOH, it gives an
>> accurate number of candidate, compared to teh holistic approach we
>> use...
>>
>> The best solution does not exist. The only way to get an accurate count
>> would be to let the user to fine tune the index (or to have a smart
>> indexer, that evolves through the analysis of the search request being
>> done... Not likely to be implemented soon ;-)
>> *
>
> Substring indexing is always a challenge, but I think trigram-based as
> we've implemented it is as near to optimal as exists. It has an
> additional weakness in that you can't use the index for shorter terms.
> E.g. we can't find "ab*" or "a*" with a 3-character substring index.
>
> The other benefit from this approach is the space tradeoff, we know
> our index is a constant factor larger than the original string. With
> string-reverse indexing, your index size is n^2/2...
Hmmm? I would say it's only nx2. The main drowback is that it does not
cover the *xyz* use case. (we don't use hashes)




Mime
View raw message