couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Statistics Module
Date Fri, 30 Jan 2009 05:57:50 GMT
On Fri, Jan 30, 2009 at 12:32 AM, Antony Blakey <antony.blakey@gmail.com> wrote:
>
> On 30/01/2009, at 9:56 AM, Paul Davis wrote:
>
>> The way that stats are calculated currently with the dependent
>> variable being time could cause some issues in implementing more
>> statistics. With my extremely limited knowledge of stats I think
>> moving that to be dependent on the number of requests might be better.
>> This is something that hopefully someone out there knows more about.
>> (This is in terms of "avg for last 5 minutes" vs "avg for last 100
>> requests", (the later of the two making stddev type stats
>> calculateable on the fly in constant memory.)
>
> The problem with using # of requests is that depending on your data, each
> request may take a long time. I have this problem at the moment: 1008
> documents in a 3.5G media database. During a compact, the status in
> _active_tasks updates every 1000 documents, so you can imagine how useful
> that is :/ I thought it had hung (and neither the beam.smp CPU time nor the
> IO requests were a good indicator). I spent some time chasing this down as a
> bug before realising the problems was in the status granularity!
>

Actually I don't think that affects my question at all. It may change
how we report things though. As in, it may be important to be able to
report things that are not single increment/decrement conditions but
instead allow for adding arbitrary floating point numbers to the
number of recorded data points.

IMO, your specific use case only gives my argument about having the
dependent variable be requests (or more specifically, data points) be
the dependent variable.

To explain more clearly the case is this: if we don't treat the
collected data points as the dependent variable we are unable to
calculate extended statistics like variance/stddev etc. This is
because if the dependent variable is time, then the number of data
points is unbounded. If that's the case we have unbounded memory
usage. (because I know of no incremental algorithms for calculating
these statistics without knowledge of past values, I could be wrong)

In other words, if we're doing stats for N values, when we store value
number N+1, we must know value 0 so it can be removed from the
calculations. If the dependent variable is time, then N can be
arbitrarily large thus causing memory problems.

Your use case just changes each data point from being a inc/dec op to
a "store arbitrary number" op. In any case, I'm not at all comfortable
relying on solely my knowledge of calculating statistics in an
incremental fashion so hopefully there's a stats buff out there who
will feel compelled to weigh in.

HTH,
Paul Davis

P.S. For those of you wondering why standard deviation is important, I
reference the ever so eloquent Zed Shaw [1] : "Programmers Need To
Learn Statistics Or I Will Kill Them All." Also, he is right.

[1] http://www.zedshaw.com/rants/programmer_stats.html


> Antony Blakey
> -------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> The ultimate measure of a man is not where he stands in moments of comfort
> and convenience, but where he stands at times of challenge and controversy.
>  -- Martin Luther King
>
>
>

Mime
View raw message