commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <p...@steitz.com>
Subject [math] Re: commons.apache.org/math/stat/
Date Mon, 09 Jun 2008 02:30:03 GMT
Albretch Mueller wrote:
> On Sun, Jun 8, 2008 at 10:18 AM, Phil Steitz <phil@steitz.com>
>   
>> Its probably best to take the discussion to the dev list.
>>     
> ~
>  Hi,
> ~
>  this thread started in user@commons.apache.org as
> "commons.apache.org/math/stat/"
> ~
>  Formal need for a way to keep incremental statistics as part of the package:
> ~
>  If you do heavy data analysis/mining you will greatly benefit from
> being able to keep the data statistics as part of the data itself and
> not having to create stat.descriptive.DescriptiveStatistics objects
> for each pattern (and I am not talking about hypothetical scenarios
> here, I encounter such problems out of parsing stats related to
> linguistic pattens in large bodies of texts, like those you found at
> large text banks like the one that the gutenberg.org project hosts)
> ~
>  When you have lots of patterns which frequency distributions you are
> interested in you don't really want to internally "maintain datasets
> of values for each of them and compute descriptive statistics based on
> stored data" you would easily keep a data structure that looks like
> this:
> ~
>  class sdt{
>   public String pattern;  // pattern
>   public long lastOffset;
> // __
>   public int tmsFnd;      // times found
>   public double mean;     // Mean
>   public double stdDev;   // standard deviation of data distribution
>   public double skew;     // skewness ^
>  }
> ~
>  which objects you would update as needed based on the latest stats
> you have kept within the object and the newly found value (which could
> be, e. g., the difference of the offset to the previous value)
>   
If you are willing to drag along UnivariateStatistics in the structures 
above, you could accomplish this by calling their increment values.  The 
statistics (such as the ones above) that do not require that their 
complete supporting datasets be stored are implemented in commons math 
as "StorelessUnivariateStatistics". These objects expose increment() 
methods that allow them to be updated based on new data values.  See, 
e.g. org.apache.commons.math.stat.descriptive.moment.Mean.
> tr~
>  By the way, after studying a bit more the structure of the API I
> would agree with you it should go in
> "org.apache.commons.math.stat.StatUtils"
>   
Yes.  I can see the usefulness of this for cases where the current API 
is too heavy.  What probably makes sense is individual update methods 
for common statistics that admit this.
> ~
>  Here is the naively deceiving Math it entails and I will need to use
> some ascii "art" here:
> ~
>  the mean for the variable X is defined as:
> ~
>  Mean(X, N) := Mean(xi, i[1, N]) := (x1 + x2 + x3 + . . . + xn)/N
> ~
>  Now, when the new (N + 1) value happen the Mean becomes
> ~
>  Mean(X, (N + 1)) := Mean(xi, i[1, (N + 1)]) := (x1 + x2 + x3 + . . .
> + xn + x(n+1))/(N + 1)
> ~
>  Algebraically playing a bit with it we get:
> ~
>  (N + 1)  * Mean(X, (N + 1)) := (x1 + x2 + x3 + . . . + xn + x(n+1))
> ~
>  (N + 1)  * Mean(X, (N + 1)) := N * Mean(X, N) + x(n+1)
> ~
>  So, that we can naturally (it is a "simple" additive induction),
> express the new Mean as a function of the old Mean and the new value:
> ~
>  I) Mean(X, (N + 1)) := (N/(N + 1))* Mean(X, N) + (x(n+1)/(N + 1))
> ~
>  So for I) you will need
> ~
>  1) "N"
>  2) new value x(n+1)
>  3) the old Mean
> ~
>  For the std Dev you will do so similarly that after explaining how it
> is for the mean it doesn't really need to be rolled out, you will then
> need:
> ~
>  1) "N"
>  2) new value x(n+1)
>  3) the old Mean
>  4) the old std Dev
> ~
>  and for the Skewness you will need:
> ~
>  1) "N"
>  2) new value x(n+1)
>  3) the old Mean
>  4) the old std Dev
>  5) the old Skewness
> ~
>  I know that just looks nice ;-), but computers are not good at
> figuring out what to do with all the rounding errors that will
> certainly appear as byproduct of all these "simply" looking Math. I am
> sure you must be using some kind of magic in order to offset those
> ~
>   
>> . . . Patches are always welcome.
>>     
> ~
>  Let me know how can I help you (/commons.apache.org/math/stat/) I
> could code some actual java but I am not familiar with the guts of the
> API/underlying framework so you would decide if you want to invite me
> as an upstream developer and mentor me initially or if you would just
> be happy with my suggestion. By the way I am a Mathematician (actually
> a theoretical Physicist myself)thr
>   
We are always happy to welcome new contributors to commons or any other 
apache project.  The best way to start working on commons math is to 
check out the developers page:  
http://commons.apache.org/math/developers.html.  Follow directions there 
to get set up with subversion and maven or Ant and Junit to build the 
code and run the tests.  Then follow the instructions on submitting 
patches through JIRA and you are off to the races :)

Please do not hesitate to ask here or mail me personally if you need 
help getting set up to build and test the code.

Phil
> ~
>  See you
>  lbrtchx
> ~
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message