On Sun, Jun 8, 2008 at 10:18 AM, Phil Steitz <phil@steitz.com>
> Its probably best to take the discussion to the dev list.
~
Hi,
~
this thread started in user@commons.apache.org as
"commons.apache.org/math/stat/"
~
Formal need for a way to keep incremental statistics as part of the package:
~
If you do heavy data analysis/mining you will greatly benefit from
being able to keep the data statistics as part of the data itself and
not having to create stat.descriptive.DescriptiveStatistics objects
for each pattern (and I am not talking about hypothetical scenarios
here, I encounter such problems out of parsing stats related to
linguistic pattens in large bodies of texts, like those you found at
large text banks like the one that the gutenberg.org project hosts)
~
When you have lots of patterns which frequency distributions you are
interested in you don't really want to internally "maintain datasets
of values for each of them and compute descriptive statistics based on
stored data" you would easily keep a data structure that looks like
this:
~
class sdt{
public String pattern; // pattern
public long lastOffset;
// __
public int tmsFnd; // times found
public double mean; // Mean
public double stdDev; // standard deviation of data distribution
public double skew; // skewness ^
}
~
which objects you would update as needed based on the latest stats
you have kept within the object and the newly found value (which could
be, e. g., the difference of the offset to the previous value)
~
By the way, after studying a bit more the structure of the API I
would agree with you it should go in
"org.apache.commons.math.stat.StatUtils"
~
Here is the naively deceiving Math it entails and I will need to use
some ascii "art" here:
~
the mean for the variable X is defined as:
~
Mean(X, N) := Mean(xi, i[1, N]) := (x1 + x2 + x3 + . . . + xn)/N
~
Now, when the new (N + 1) value happen the Mean becomes
~
Mean(X, (N + 1)) := Mean(xi, i[1, (N + 1)]) := (x1 + x2 + x3 + . . .
+ xn + x(n+1))/(N + 1)
~
Algebraically playing a bit with it we get:
~
(N + 1) * Mean(X, (N + 1)) := (x1 + x2 + x3 + . . . + xn + x(n+1))
~
(N + 1) * Mean(X, (N + 1)) := N * Mean(X, N) + x(n+1)
~
So, that we can naturally (it is a "simple" additive induction),
express the new Mean as a function of the old Mean and the new value:
~
I) Mean(X, (N + 1)) := (N/(N + 1))* Mean(X, N) + (x(n+1)/(N + 1))
~
So for I) you will need
~
1) "N"
2) new value x(n+1)
3) the old Mean
~
For the std Dev you will do so similarly that after explaining how it
is for the mean it doesn't really need to be rolled out, you will then
need:
~
1) "N"
2) new value x(n+1)
3) the old Mean
4) the old std Dev
~
and for the Skewness you will need:
~
1) "N"
2) new value x(n+1)
3) the old Mean
4) the old std Dev
5) the old Skewness
~
I know that just looks nice ;), but computers are not good at
figuring out what to do with all the rounding errors that will
certainly appear as byproduct of all these "simply" looking Math. I am
sure you must be using some kind of magic in order to offset those
~
> . . . Patches are always welcome.
~
Let me know how can I help you (/commons.apache.org/math/stat/) I
could code some actual java but I am not familiar with the guts of the
API/underlying framework so you would decide if you want to invite me
as an upstream developer and mentor me initially or if you would just
be happy with my suggestion. By the way I am a Mathematician (actually
a theoretical Physicist myself)
~
See you
lbrtchx
~

To unsubscribe, email: devunsubscribe@commons.apache.org
For additional commands, email: devhelp@commons.apache.org
