commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] Proposal for New way of Computing an approximate Percentile without storing input data
Date Sat, 22 Mar 2014 21:40:23 GMT
On 3/22/14, 2:11 PM, venkatesha m wrote:
> Hi,
>
> I would like to propose for adding new way of computing the percentile without needing
to store most of input data.  Since this is my first time on contributing to apache; please
help me / correct me if i miss any procedure here.
>
> Here are the details.
>
> Description: 
> The Percentile calculation in a  traditional way require all the data points to be stored
and sorted before accesiing the pth Percentile value of the data set. However the storage
of points can become prohibitive when we need to make use of the existing Percentile Implementation
at big data scale(For eg: when computing the daily or weekly percentile value of a certain
performance metric where the data points accumulated over day and week may run to GB and TB).
 While platforms such as hadoop exist to solve the data scale issue; the need for a statistical
computation of quantiles without storing data is an absolute essential.
> While looking in commons-math classes though Percentile class is available it is implemented
with storage of input as requirement. So was wondering if we could add a class to calculate
Percentile without needing to store data.  The algorithm that i have chosen to implement and
propose is based on P Square algorithm ( http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
) which requires a minimal and finite set of memory stores to compute percentiles for continuous
stream of data.
>
> Ref: 
> http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing representation of
the workflow of the algorithm
>
> Advantages: 
> a) As is claimed in the orignal workd the accuracy improves over moderate to large data
sets which is the need.
> b) A minimal and constant sized data store used to compute a large data set 
> c) Useful in Hadoop Map reduce applications
>
> Implementation:
> I have implemented this algorithm based on StorelessUnivariateStatistic after checking
out from 3.2 branch.  I have also opened a JIRA ticket on the same (https://issues.apache.org/jira/browse/MATH-1112
) for requesting a new feature to be added. 
>
> Please let me know when and how i could send my code for review.

Thanks, Murthy and welcome!

I am personally fine with the P-Square algorithm and would welcome a
patch including implementation.  Unless others disagree with this
approach (give it a day or two), I would go ahead and attach a patch
with the implementation to the JIRA you opened.

Thanks in advance for your contributions!

Phil
>
> thanks
> murthy


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message