jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Davide Giannella (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (OAK-7300) Lucene Index: per-column selectivity to improve cost estimation
Date Wed, 09 Jan 2019 17:07:00 GMT

     [ https://issues.apache.org/jira/browse/OAK-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Davide Giannella updated OAK-7300:
    Fix Version/s:     (was: 1.10.0)

> Lucene Index: per-column selectivity to improve cost estimation
> ---------------------------------------------------------------
>                 Key: OAK-7300
>                 URL: https://issues.apache.org/jira/browse/OAK-7300
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>            Priority: Major
>             Fix For: 1.12
> In OAK-6735 we have improved cost estimation for Lucene indexes, however the following
case is still not working as expected: a very common property is indexes (many nodes have
that property), and each value of that property is more or less unique. In this case, currently
the cost estimation is the total number of documents that contain that property. Assuming
the condition "property is not null" this is correct, however for the common case "property
= x" the estimated cost is far too high.
> A known workaround is to set the "costPerEntry" for the given index to a low value, for
example 0.2. However this isn't a good solution, as it affects all properties and queries.
> It would be good to be able to set the selectivity per property, for example by specifying
the number of distinct values, or (better yet) the average number of entries for a given key
(1 for unique values, 2 meaning for each distinct values there are two documents on average).
> That value can be set manually (cost override), and it can be set automatically, e.g.
when building the index, or updated from time to time during the index update, using a cardinality
> estimation algorithm. That doesn't have to be accurate; we could use an rough approximation
such as hyperbitbit.

This message was sent by Atlassian JIRA

View raw message