lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Houston Putman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-11711) Improve memory usage of pivot facets
Date Fri, 01 Dec 2017 16:33:00 GMT

     [ https://issues.apache.org/jira/browse/SOLR-11711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Houston Putman updated SOLR-11711:
----------------------------------
    Description: 
Currently while sending pivot facet requests to each shard, the {{facet.pivot.mincount}} is
set to {{0}} if the facet is sorted by count with a specified limit > 0. However with a
mincount of 0, the pivot facet will use exponentially more wasted memory for every pivot field
added. This is because there will be a total of {{limit^(# of pivots)}} pivot values created
in memory, even though the vast majority of them will have counts of 0, and are therefore
useless.

Imagine the scenario of a pivot facet with 3 levels, and {{facet.limit=1000}}. There will
be a billion pivot values created, and there will almost definitely be nowhere near a billion
pivot values with counts > 0.

This likely due to the reasoning mentioned in [this comment in the original distributed pivot
facet ticket|https://issues.apache.org/jira/browse/SOLR-2894?focusedCommentId=13979898]. Basically
it was thought that the refinement code would need to know that a count was 0 for a shard
so that a refinement request wasn't sent to that shard. However this is checked in the code,
[in this part of the refinement candidate checking|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.1.0/solr/core/src/java/org/apache/solr/handler/component/PivotFacetField.java#L275].
Therefore if the {{pivot.mincount}} was set to 1, the non-existent values would either:
* Not be known, because the {{facet.limit}} was smaller than the number of facet values with
positive counts. This isn't an issue, because they wouldn't have been returned with {{pivot.mincount}}
set to 0.
* Would be known, because the {{facet.limit}} would be larger than the number of facet values
returned. therefore this conditional would return false (since we are only talking about pivot
facets sorted by count).

The solution, is to use the same pivot mincount as would be used if no limit was specified.


This also relates to a similar problem in field faceting that was "fixed" in [SOLR-8988|https://issues.apache.org/jira/browse/SOLR-8988#13324].
The solution was to add a flag, {{facet.distrib.mco}}, which would enable not choosing a mincount
of 0 when unnessesary. Since this flag can only increase performance, and doesn't break any
queries I have removed it as an option and replaced the code to use the feature always. 
There was one code change necessary to fix the MCO option, since the refinement candidate
selection logic had a bug. The bug only occured with a minCount > 0 and limit > 0 specified.
When a shard replied with less than the limit requested, it would assume the next maximum
count on that shard was the {{mincount}}, where it would actually be the {{mincount-1}} (because
a facet value with a count of mincount would have been returned). Therefore the MCO didn't
cause any errors, but with a mincount of 1 the refinement logic always assumed that the shard
had more values with a count of 1.

  was:
Currently while sending pivot facet requests to each shard, the {{facet.pivot.mincount}} is
set to {{0}} if the facet is sorted by count with a specified limit > 0. However with a
mincount of 0, the pivot facet will use exponentially more wasted memory for every pivot field
added. This is because there will be a total of {{limit^(# of pivots)}} pivot values created
in memory, even though the vast majority of them will have counts of 0, and are therefore
useless.

Imagine the scenario of a pivot facet with 3 levels, and {{facet.limit=1000}}. There will
be a billion pivot values created, and there will almost definitely be nowhere near a billion
pivot values with counts > 0.

This likely due to the reasoning mentioned in [this comment in the original distributed pivot
facet ticket|https://issues.apache.org/jira/browse/SOLR-2894?focusedCommentId=13979898]. Basically
it was thought that the refinement code would need to know that a count was 0 for a shard
so that a refinement request wasn't sent to that shard. However this is checked in the code,
[in this part of the refinement candidate checking|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.1.0/solr/core/src/java/org/apache/solr/handler/component/PivotFacetField.java#L275].
Therefore if the {{pivot.mincount}} was set to 1, the non-existent values would either:
* Not be known, because the {{facet.limit}} was smaller than the number of facet values with
positive counts. This isn't an issue, because they wouldn't have been returned with {{pivot.mincount}}
set to 0.
* Would be known, because the {{facet.limit}} would be larger than the number of facet values
returned. therefore this conditional would return false (since we are only talking about pivot
facets sorted by count).

The solution, is to use the same pivot mincount as would be used if no limit was specified.


This also relates to a similar problem in field faceting that was "fixed" in [SOLR-8988|https://issues.apache.org/jira/browse/SOLR-8988#13324].
The solution was to add a flag, {{facet.distrib.mco}}, which would enable not choosing a mincount
of 0 when unnessesary. Since this flag can only increase performance, and doesn't break any
queries I have removed it as an option and replaced the code to use the feature always.


> Improve memory usage of pivot facets
> ------------------------------------
>
>                 Key: SOLR-11711
>                 URL: https://issues.apache.org/jira/browse/SOLR-11711
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: faceting
>    Affects Versions: master (8.0)
>            Reporter: Houston Putman
>              Labels: pull-request-available
>             Fix For: 5.6, 6.7, 7.2
>
>
> Currently while sending pivot facet requests to each shard, the {{facet.pivot.mincount}}
is set to {{0}} if the facet is sorted by count with a specified limit > 0. However with
a mincount of 0, the pivot facet will use exponentially more wasted memory for every pivot
field added. This is because there will be a total of {{limit^(# of pivots)}} pivot values
created in memory, even though the vast majority of them will have counts of 0, and are therefore
useless.
> Imagine the scenario of a pivot facet with 3 levels, and {{facet.limit=1000}}. There
will be a billion pivot values created, and there will almost definitely be nowhere near a
billion pivot values with counts > 0.
> This likely due to the reasoning mentioned in [this comment in the original distributed
pivot facet ticket|https://issues.apache.org/jira/browse/SOLR-2894?focusedCommentId=13979898].
Basically it was thought that the refinement code would need to know that a count was 0 for
a shard so that a refinement request wasn't sent to that shard. However this is checked in
the code, [in this part of the refinement candidate checking|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.1.0/solr/core/src/java/org/apache/solr/handler/component/PivotFacetField.java#L275].
Therefore if the {{pivot.mincount}} was set to 1, the non-existent values would either:
> * Not be known, because the {{facet.limit}} was smaller than the number of facet values
with positive counts. This isn't an issue, because they wouldn't have been returned with {{pivot.mincount}}
set to 0.
> * Would be known, because the {{facet.limit}} would be larger than the number of facet
values returned. therefore this conditional would return false (since we are only talking
about pivot facets sorted by count).
> The solution, is to use the same pivot mincount as would be used if no limit was specified.

> This also relates to a similar problem in field faceting that was "fixed" in [SOLR-8988|https://issues.apache.org/jira/browse/SOLR-8988#13324].
The solution was to add a flag, {{facet.distrib.mco}}, which would enable not choosing a mincount
of 0 when unnessesary. Since this flag can only increase performance, and doesn't break any
queries I have removed it as an option and replaced the code to use the feature always. 
> There was one code change necessary to fix the MCO option, since the refinement candidate
selection logic had a bug. The bug only occured with a minCount > 0 and limit > 0 specified.
When a shard replied with less than the limit requested, it would assume the next maximum
count on that shard was the {{mincount}}, where it would actually be the {{mincount-1}} (because
a facet value with a count of mincount would have been returned). Therefore the MCO didn't
cause any errors, but with a mincount of 1 the refinement logic always assumed that the shard
had more values with a count of 1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message