lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Munendra S N (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula
Date Wed, 04 Dec 2019 13:12:00 GMT

     [ https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Munendra S N updated SOLR-11725:
--------------------------------
        Parent: SOLR-14006
    Issue Type: Sub-task  (was: Improvement)

> json.facet's stddev() function should be changed to use the "Corrected sample stddev"
formula
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11725
>                 URL: https://issues.apache.org/jira/browse/SOLR-11725
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for {{facet.pivot+stats.field}}
vs {{json.facet}} I noticed that the {{stddev}} calculations done between the two code paths
can be measurably different, and realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat nerds I
know online to help me sanity check if these equations (some how) reduced to eachother (In
which case the discrepancies I was seeing in my results might have just been due to the order
of intermediate operation execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, and explained
that the code JSON Faceting is using is equivalent to the "Uncorrected sample stddev" formula,
while StatsComponent's code is equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and pressed
them to explain which one was the "most canonical" (or "most generally applicable" or "best")
definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to differ significantly
when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a Solr result
set (or against a sub-set of results defined by a facet constraint) is probably to compare
that distribution to a different Solr result set (or to compare N sub-sets of results defined
by N facet constraints)
> * the size of the sets of documents (values) can be relatively small when computing stats
over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected sample stddev"
equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Mime
View raw message