lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <ysee...@gmail.com>
Subject Re: Nested facet complete wrong counts
Date Sat, 11 Nov 2017 00:47:28 GMT
I do notice you are using hll (hyper-log-log) which is a distributed
cardinality *estimate* : https://en.wikipedia.org/wiki/HyperLogLog

-Yonik


On Fri, Nov 10, 2017 at 11:32 AM, kenny <kenny@ontoforce.com> wrote:
> Hi all,
>
> We are doing some tests in solr 6.6 with json facet api and we get
> completely wrong counts for some combination of  facets
>
> Setting: We have a set of fields for 376k documents in our query (total 120M
> documents). We work with 2 shards. When doing first a faceting over the
> first facet and keeping these numbers, we subsequently do a nested faceting
> over both facets.
>
> Then we add the numbers of sub-facet and expect to get the (approximately)
> the same numbers back. Sometimes we get rounding errors of about 1%
> difference. But on other occasions it seems to way off
>
> for example
>
> Gender (3 values) Country (211 values)
> 16226 - 18424 = -2198 (-13.5461604832%)
> 282854 - 464387 = -181533 (-64.1790464338%)
> 40489 - 47902 = -7413 (-18.3086764306%)
> 36672 - 49749 = -13077 (-35.6593586387%)
>
> Gender (3 values)  Status (17 Values)
> 16226 - 16273 = -47 (-0.289658572661%)
> 282854 - 435974 = -153120 (-54.1339348215%)
> 40489 - 49925 = -9436 (-23.305095211%)
> 36672 - 54019 = -17347 (-47.3031195462%)
>
> ...
>
> These are the typical requests we submit. So note that we have refine and an
> overrequest, but we in the case of Gender vs Request we should query all the
> buckets anyway.
>
> {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"hll(Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}","q":"*:*","fq":["type:\"something\""]}
>
> {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"type\":\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\"refine\":true,\"overrequest\":10,\"limit\":10,\"offset\":0,\"facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(Gender_sf)\"}","q":"*:*","fq":["type:\"something\""]}
>
> Is this a known bug? Would switching to old facet api resolve this? Are
> there other parameters we miss?
>
>
> Thanks
>
>
> kenny
>
>

Mime
View raw message