commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MATH-418) add a storeless version of Percentile
Date Sun, 20 Apr 2014 02:05:16 GMT

    [ https://issues.apache.org/jira/browse/MATH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975026#comment-13975026
] 

Ted Dunning commented on MATH-418:
----------------------------------

Murthy,

I think that you didn't get the point.

All of these algorithms (except those retaining all of the dat) are approximate.  That isn't
news, nor does it differentiate the algorithms.

The questions are:

a) is only a single quantile computed and is that quantile flexible?

b) how accurate is the quantile calculation?

c) how does the accuracy vary as a function of quantile, especially near 0 or 1?

d) how long does it take to insert a new point?

My assertion is that t-digest dominates all of these other options on all of these questions
except possibly for (d) where it roughly matches them.  I am not sure I see the point of implementing
these other algorithms if they provide no advantage.

To be specific:

GK - is relatively fast and allows computation of multiple quantiles.  Memory usage for a
large number of quantiles is large, accuracy is less than t-digest controlling for memory
usage.  Accuracy at quantiles near 0 or 1 is several orders of magnitude worse.  GK is not
particularly susceptible to skewed distributions.

Q-digest - only works on an integer domain which makes it not work well at all for skewed
distributions.  It has comparable accuracy as GK when quantization is not a factor.  Memory
usage is quite comparable to t-digest, but accuracy is noticeably worse and near q=0 or 1,
accuracy is several orders of magnitude worse.

BinMedian and BinApprox - only computes the median, fails entirely for skewed distributions
where the median is outside the range mean - sd to mean + sd.  Accuracy is comparable to other
algorithms but since no quantiles other than median are computed it doesn't even compete.


> add a storeless version of Percentile
> -------------------------------------
>
>                 Key: MATH-418
>                 URL: https://issues.apache.org/jira/browse/MATH-418
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 2.1
>            Reporter: Luc Maisonobe
>             Fix For: 4.0
>
>         Attachments: patch
>
>
> The Percentile class can handle only in-memory data.
> It would be interesting to use an on-line algorithm to estimate quantiles as a storeless
statistic.
> An example of such an algorithm is the exponentially weighted stochastic approximation
 described in a 2000 paper by Fei Chen ,  Diane Lambert  and José C. Pinheiro "Incremental
Quantile Estimation for Massive Tracking" which can be retrieved from CiteSeerX at [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message