jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vikas Saurabh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (OAK-7929) Incorrect Facet Count With Large Dataset and ACLs
Date Mon, 03 Dec 2018 23:17:00 GMT

    [ https://issues.apache.org/jira/browse/OAK-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706465#comment-16706465
] 

Vikas Saurabh edited comment on OAK-7929 at 12/3/18 11:16 PM:
--------------------------------------------------------------

Since checking for ACL is fairly expensive and to get accurate count we'd have to do ACL check
over the whole result set. So, we'd expand current boolean form of {{secure}} to an enum -
{{insecure}}, {{statistical}} and {{secure}}.

The {{insecure}} mode won't do any ACL check and would return facets as were returned from
index. The {{secure}} mode would generalize the current form for checking 50 documents to
the whole result set instead.

The {{statistical}} mode would randomly sample some documents from the result set. It'd see
ratio of accessible samples and extrapolate the facet counts with returned ratio. A few implementation
details below:
* default mode would be kept -{{statistical}} with default {{sampleSize}} as 1000- {{secure}}
(for backward compatibility).
* when {{statistical}} mode is selected, the default value of {{sampleSize}} is 1000
* both the defaults (mode and sampleSize) can be over-ridden system wide using JVM param {{oak.facets.secure}}
and {{oak.facet.statistical.sampleSize}}
* one can also set {{secure}} (String) and {{sampleSize}} (long casted to int) under {{<definition>/facets}}
to override these per index definition
* the sampling is done using idea presented in https://dl.acm.org/citation.cfm?id=368159
* the reason to pick 1000 as default sample size as expected error rate in sampled data is
given by {{sampleSize ^ -0.5}} \[0]. For 1000, this roughly comes out as 3% expected error
rate.
* for random number seed, we'd insert a random long number {{seed}} under index definition
during an indexing cycle. This is kept to keep consistent result across refreshes without
any other change in indexed data. From random-ness pov this should still be ok as actual generated
random numbers depend on result size; which in turn would depend on search query and indexed
data. From security pov, the seed should be ok as index defs are administrative data.

\[0]: https://onlinecourses.science.psu.edu/stat100/node/16/


was (Author: catholicon):
Since checking for ACL is fairly expensive and to get accurate count we'd have to do ACL check
over the whole result set. So, we'd expand current boolean form of {{secure}} to an enum -
{{insecure}}, {{statistical}} and {{secure}}.

The {{insecure}} mode won't do any ACL check and would return facets as were returned from
index. The {{secure}} mode would generalize the current form for checking 50 documents to
the whole result set instead.

The {{statistical}} mode would randomly sample some documents from the result set. It'd see
ratio of accessible samples and extrapolate the facet counts with returned ratio. A few implementation
details below:
* default mode would be kept {{statistical}} with default {{sampleSize}} as 1000.
* both the defaults can be over-ridden system wide using JVM param {{oak.facets.secure}} and
{{oak.facet.statistical.sampleSize}}
* one can also set {{secure}} (String) and {{sampleSize}} (long casted to int) under {{<definition>/facets}}
to override these per index definition
* the sampling is done using idea presented in https://dl.acm.org/citation.cfm?id=368159
* the reason to pick 1000 as default sample size as expected error rate in sampled data is
given by {{sampleSize ^ -0.5}} \[0]. For 1000, this roughly comes out as 3% expected error
rate.
* for random number seed, we'd insert a random long number {{seed}} under index definition
during an indexing cycle. This is kept to keep consistent result across refreshes without
any other change in indexed data. From random-ness pov this should still be ok as actual generated
random numbers depend on result size; which in turn would depend on search query and indexed
data. From security pov, the seed should be ok as index defs are administrative data.

\[0]: https://onlinecourses.science.psu.edu/stat100/node/16/

> Incorrect Facet Count With Large Dataset and ACLs
> -------------------------------------------------
>
>                 Key: OAK-7929
>                 URL: https://issues.apache.org/jira/browse/OAK-7929
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>            Priority: Major
>             Fix For: 1.10
>
>         Attachments: 0001-OAK-7930-Add-tape-sampling.patch, 0002-OAK-7929-Incorrect-Facet-Count-With-Large-Dataset-an.patch
>
>
> Currently ACL (secure) facet handling only deals with first batch of results from lucene
index (50 documents). So, for large result sets, the facet count hence doesn't get decremented
for large part of the result set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message