Return-Path: X-Original-To: apmail-commons-issues-archive@minotaur.apache.org Delivered-To: apmail-commons-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2966F10120 for ; Sun, 23 Mar 2014 16:39:10 +0000 (UTC) Received: (qmail 75752 invoked by uid 500); 23 Mar 2014 16:39:05 -0000 Delivered-To: apmail-commons-issues-archive@commons.apache.org Received: (qmail 75343 invoked by uid 500); 23 Mar 2014 16:38:57 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 75036 invoked by uid 99); 23 Mar 2014 16:38:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Mar 2014 16:38:45 +0000 Date: Sun, 23 Mar 2014 16:38:45 +0000 (UTC) From: "Ted Dunning (JIRA)" To: issues@commons.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MATH-418) add a storeless version of Percentile MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MATH-418?page=3Dcom.atlassian.j= ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D139444= 70#comment-13944470 ]=20 Ted Dunning commented on MATH-418: ---------------------------------- The t-digest is also available. =20 You can read more at https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.= pdf?raw=3Dtrue This paper contains accuracy comparisons against the DataFu StreamingQuanti= le implementation of GK01 and against the Q-digest algorithm. T-digest dom= inates both in terms of space and accuracy. I haven't tested for speed yet= since the t-digest implementation is improving rapidly. T-digest is also = a very simple algorithm conceptually which helps with accurate implementati= on. T-digest is particularly good at maintaining accuracy for extreme quan= tiles and for highly skewed distribution. GK01 does well on skewed distrib= utions, but maintains constant absolute accuracy rather than relative accur= acy. Q-digest depends on integer representations and thus fails badly on s= ufficiently skewed distributions. =20 The library involved is available via maven and is apache licensed. Apache= Commons Math has a "no dependency" policy which might mean that sucking in= the code would be a better option than simply linking to this. The standard of the art before t-digest is generally considered to be eithe= r Greenwald and Khanna's algorithm GK01 (what DataFu uses) or the Q-digest = (what stream-lib used before they adopted t-digest). References are in the= paper above. In case it isn't obvious, source code is available on github at https://github.com/tdunning/t-digest The p^2 algorithm is actually quite old and is far from the state of the ar= t. > add a storeless version of Percentile > ------------------------------------- > > Key: MATH-418 > URL: https://issues.apache.org/jira/browse/MATH-418 > Project: Commons Math > Issue Type: New Feature > Affects Versions: 2.1 > Reporter: Luc Maisonobe > Fix For: 4.0 > > > The Percentile class can handle only in-memory data. > It would be interesting to use an on-line algorithm to estimate quantiles= as a storeless statistic. > An example of such an algorithm is the exponentially weighted stochastic = approximation described in a 2000 paper by Fei Chen , Diane Lambert and = Jos=C3=A9 C. Pinheiro "Incremental Quantile Estimation for Massive Tracking= " which can be retrieved from CiteSeerX at [http://citeseerx.ist.psu.edu/vi= ewdoc/summary?doi=3D10.1.1.105.1580]. -- This message was sent by Atlassian JIRA (v6.2#6252)