On Mon, Mar 14, 2011 at 12:34 PM, Luc Maisonobe <Luc.Maisonobe@free.fr>wrote:
> Le 14/03/2011 15:33, Benson Margulies a écrit :
> > Please excuse the following ignorant question.
> >
> > I want to maintain summary statistics of a rate. At each 'event', I
> > know the number of characters and the time it took to process them,
> > and I want to maintain summary statistics for the rate of
> > chars/second. I imagine that I'm missing something basic, but I don't
> > see how to do this.
>
> You should define some windows width, either in terms of a time span
> (all events in the last n seconds) or in terms of number of events (last
> n events).
>
> In [math], we do not provide (yet) anything for maintaining such a data
> structure, you'll have to maintain the events in this slot by yourself,
> with something similar to a FIFO.
>
The problem for Benson's case is that the FIFO is a global structure.
Computing the global moving
average is a bit trickier.
> Another thing we discussed some months ago (but did not implement yet)
> is a way to compute an approximation of percentiles in a flow of data
> without storing them. There is an interesting algorithm for it that was
> developed for the needs of telecommunication companies, I think it may
> be of interest to you. This would provide results like : currently 95%
> of the characters are processed in n milliseconds. would you be
> interested in us implementing this feature ?
>
I would recommend stealing it instead from the OnlineSummarizer in Mahout
which computes rank
statistics as well as average and standard deviation in a completely online
way. The only rank statistics
computed with now are the quartiles, but this can easily be changed to
compute 1, 5, 95 and 99%iles. This
is done in a completely online way (O(1) memory and time per sample) and is
roughly as accurate for
computing quantiles of a distribution as sorting all the samples and
observing the empirical quantiles.
The current version simplifies the published algorithm slightly. As a
result, some extremely skewed distributions
can have quartiles that are in error. For instance, if I remember
accurately, the unit gamma distribution with
shape = 0.001 has a minimum of 0 and a median of 10^50 while the Mahout
code only gives an estimate of the
median of 10^20. Relative to the distance between the min to the median,
this is an unfortunately large error.
Relative to the distance from the mean to the min or relative to the
interquartile distance, this error is trivial.
I know of no practical use cases for such an extreme distributions except as
very conservative scale preserving
priors in some Bayesian computations. Even in these cases, estimation of
the quantiles is not useful since only
the posterior is of interest and it invariably is less extreme if any data
is considered.
