db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rick Hillegas (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DERBY-5466) Add support for SQL Standard statistics functions, such as STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP
Date Tue, 04 Aug 2015 03:12:04 GMT

     [ https://issues.apache.org/jira/browse/DERBY-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Rick Hillegas updated DERBY-5466:
    Attachment: var_pop_formulas.txt

Thanks for the patch, Scott. Your factoring of the hard part into the population variance
class (VarPAggregator, var_pop()) is very elegant.

It took me a while to refresh my memory of what these functions do. Other people may appreciate
the description of these functions found here: http://www.macroption.com/population-sample-variance-standard-deviation/

Googling around, I found several formulas for calculating population variance. The easiest
one to understand is the Sybase approach: http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc38151.1520/html/iqrefbb/CHDGJBBI.htm
This is the Sybase formula:

  1/n * sum( (xi - m)(xi - m) )


    m is the mean
    n is the number of items in the population
    and xi (for i = 1...n) are the items in the population

IBM gives another formula (http://www-01.ibm.com/support/knowledgecenter/ssw_ibm_i_71/db2/rbafzcolvar.htm):

  sum( xi * xi )/n - m*m

It took me a while to convince myself that the formulas are equivalent. For future reference,
I have captured my reasoning in the attached var_pop_formulas.txt file.

I can't provide any arguments for whether one of these formulas is more accurate than the
other in terms of not losing precision as the values accumulate. However, the IBM formula
does not expose Derby to out-of-memory errors while building an ArrayList which may come to
hold hundreds of millions of values.

I notice that you are using the Sybase formula. I recommend reworking VarPAggregator to use
the IBM formula, instead.


> Add support for SQL Standard statistics functions, such as STDDEV_POP, STDDEV_SAMP, VAR_POP,
> -----------------------------------------------------------------------------------------------------
>                 Key: DERBY-5466
>                 URL: https://issues.apache.org/jira/browse/DERBY-5466
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions:
>            Reporter: Lukas Eder
>            Priority: Minor
>              Labels: derby_triage10_10
>         Attachments: DERBY-5466.stats.patch, var_pop_formulas.txt
> Any of these RDBMS support the SQL standard statistics functions STDDEV_POP, STDDEV_SAMP,
> - DB2 (only STDDEV, VARIANE)
> - H2 
> - Ingres 
> - MySQL 
> - Oracle 
> - Postgres 
> - SQL Server (named STDEVP, STDEV, VARP, VAR)
> - Sybase ASE
> - Sybase SQL Anywhere
> These don't:
> - Derby
> - SQLite
> This would be a useful addition for Derby, I think.
> An even larger example list of possible statistics aggregate functions is listed in the
Postgres documentation:
> http://www.postgresql.org/docs/9.0/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE

This message was sent by Atlassian JIRA

View raw message