lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
Date Thu, 19 Jul 2018 17:31:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549602#comment-16549602
] 

ASF subversion and git services commented on SOLR-12343:
--------------------------------------------------------

Commit a7fe950074a834edc070c265df1394181b268683 in lucene-solr's branch refs/heads/branch_7x
from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a7fe950 ]

SOLR-12343: Fixed a bug in JSON Faceting that could cause incorrect counts/stats when using
non default sort options

This also adds a new configurable "overrefine" option

(cherry picked from commit 3a5d4a25df310d2021fa947ea593cc9b3c93a386)


> JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-12343
>                 URL: https://issues.apache.org/jira/browse/SOLR-12343
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Yonik Seeley
>            Priority: Major
>         Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch,
SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch,
__incomplete_processEmpty_microfix.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement can cause
_refined_ buckets to be "bumped out" of the topN based on the refined counts/stats depending
on the sort - causing _unrefined_ buckets originally discounted in phase#2 to bubble up into
the topN and be returned to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a {{sort: 'count
asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low shard1
counts
>  ** but *not* returned at all by shard2, because these terms both have very high shard2
counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete count/stat/sub-facet
data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a significantly
higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the client have counts/stats
that are the cumulation of all shards, but termY only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. Additional
overrequest just increases the number of "extra" terms needed in the index with "better" sort
values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement can cause
a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) asc|desc}}
, etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message