commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MATH-418) add a storeless version of Percentile
Date Sun, 23 Mar 2014 16:38:45 GMT

    [ https://issues.apache.org/jira/browse/MATH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944470#comment-13944470
] 

Ted Dunning commented on MATH-418:
----------------------------------

The t-digest is also available.  

You can read more at

https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true

This paper contains accuracy comparisons against the DataFu StreamingQuantile implementation
of GK01 and against the Q-digest algorithm.  T-digest dominates both in terms of space and
accuracy.  I haven't tested for speed yet since the t-digest implementation is improving rapidly.
 T-digest is also a very simple algorithm conceptually which helps with accurate implementation.
 T-digest is particularly good at maintaining accuracy for extreme quantiles and for highly
skewed distribution.  GK01 does well on skewed distributions, but maintains constant absolute
accuracy rather than relative accuracy.  Q-digest depends on integer representations and thus
fails badly on sufficiently skewed distributions.  

The library involved is available via maven and is apache licensed.  Apache Commons Math has
a "no dependency" policy which might mean that sucking in the code would be a better option
than simply linking to this.

The standard of the art before t-digest is generally considered to be either Greenwald and
Khanna's algorithm GK01 (what DataFu uses) or the Q-digest (what stream-lib used before they
adopted t-digest).  References are in the paper above.

In case it isn't obvious, source code is available on github at

https://github.com/tdunning/t-digest

The p^2 algorithm is actually quite old and is far from the state of the art.

> add a storeless version of Percentile
> -------------------------------------
>
>                 Key: MATH-418
>                 URL: https://issues.apache.org/jira/browse/MATH-418
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 2.1
>            Reporter: Luc Maisonobe
>             Fix For: 4.0
>
>
> The Percentile class can handle only in-memory data.
> It would be interesting to use an on-line algorithm to estimate quantiles as a storeless
statistic.
> An example of such an algorithm is the exponentially weighted stochastic approximation
 described in a 2000 paper by Fei Chen ,  Diane Lambert  and José C. Pinheiro "Incremental
Quantile Estimation for Massive Tracking" which can be retrieved from CiteSeerX at [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message