commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil Steitz" <p...@steitz.com>
Subject Re: [math] log representaion of sums was:Re: [math] Priorities, help needed
Date Sat, 24 May 2003 17:34:51 GMT
Mark R. Diggory wrote:
> Phil Steitz wrote:
> 
>> Brent Worden wrote:
>>
>>> Agreed.  I would like to add that I think we're a little overly 
>>> concerned
>>> about the actual implementation of the algorithm.  In these early 
>>> stages of
>>> the project, I think it's wiser to spend time discussing the evolving 
>>> design
>>> and API.  In the end, that is how people will judge the value of this
>>> project.  People will care far less about how rock-solid the 
>>> geometric mean
>>> algorithm is compared to how many features does it provide and how 
>>> easy is
>>> it to use.
>>
>>
>>
>> I could not agree more.  I have been using (and sharing) the original, 
>> no-storage, no-rolling version of Univariate for a couple of years now 
>> and have found it to be simple, lightweight and easy to use.  That is 
>> why I contributed it.  The only thing that I think we really need to 
>> worry about as we get the initial release together is that we 
>> carefully document the interfaces and the contracts -- otherwise the 
>> stuff will not be usable -- and maintain implementation quality.  We 
>> should try to avoid stupid things and really bad numerical algorithms, 
>> but I agree that our focus should be on getting basic, easy to use, 
>> frequently demanded functionality into the package.  Regarding 
>> Univariate in particular, my feeling is that the most important things 
>> to get in there are percentiles and confidence intervals.  These are 
>> what people actually use (beyond the arithmetic mean and variance).
>>
>> Have you looked at the task list here:
>> http://jakarta.apache.org/commons/sandbox/math/tasks.html?
>>
>> Do you have a) comments on these / alternative suggestions  b) code to 
>> contribute or c) time to spend helping with implementation?
> 
> 
> I'm concerned its starting to get difficult to see such clear interfaces 
> with all the code piled up in one package. Refactoring is relatively 
> easy at this stage. I want to suggest we begin to isolate different 
> functionalities in separate packages for clarity's sake.

We certainly need to deal with this "soon". I posted a similar 
decomposition a while back.  Robert suggested that we wait until we had 
assembled more material.  I think it might be best to wait just a bit 
longer, since I think we all agree that we need to discuss scope and 
where we end up there will impact what the natural package structure is.

> 
> One possibility is:
> 
> *org.apache.commons.math.random*
> 
> EmpiricalDistribution
> EmpiricalDistributionImpl
> RandomData
> RandomDataImpl
> ValueServer
> 
> *org.apache.commons.math.la*
> 
> RealMatrix
> RealMatrixImpl
> 
> *org.apache.commons.math.util*
> 
> ContractableDoubleArray
> ExpandableDoubleArray
> FixedDoubleArray
> DoubleArray
> 
> *org.apache.commons.math.stat*
> 
> TestStatistic
> TestStatisticImpl
> Freq
> Univariate
> UnivariateImpl
> ListUnivariateImpl
> AbstractStoreUnivariate
> StoreUnivariate
> 
> 
> The idea being similar in nature to the SAX or DOM api's. Maybe we can 
> establish a set of interfaces/factories for these implementations. Maybe 
> there are questions about having the "Impl" vs having a factory approach 
> to object instantiation. I'm not sure that there would be enough 
> "Implementations" to support a API/spec with Factory based instantiation.

In each of the cases where interfaces have been abstracted, I think that 
there likely will be multiple implementations and in fact one of the 
advantages of commons-math should be extensibility.  I don't much like 
the "Impl" names. They should probably "soon" be changed to be 
meaningful.  For example (see more below) "RandomDataImpl" should 
probably be called something like "JDKRandomData" and "UnivariateImpl" 
should be called something like "StreamUnivariate" or 
"RollingUnivariate".  "RealMatrixImpl" should be something like 
"DoubleRealMatrix" (allowing "BigDecimalRealMatrix")  Here again, I 
would hold off just a bit longer before jumping into this.

> 
> I do have some concerns about the Random library and Random Number 
> Generation/Distributions.
> 
> 1.) the JDK provides for "plugability" behind their Random number 
> generatator. So you can plug different implmentations in behind it, 
> ideally this should be taken advantage of in terms of providing 
> different methods of random number generation. This is probibly one 
> limitation the CERN random generation libraries.
> 
I thought about this and it is addressed (sort of) in two ways in the 
current setup.  First, abstracting the RandomData interface enables 
virtually any kind of implementation to be "plugged in". Second, the 
setSecureAlgorithmpl method of RandomDataImpl allows the underlying 
algorithm and provider for the "secure" methods to be reset.  The basic 
interface and the JDK-based implementation was designed to be simple, 
easy to use, but supporting some simple, generally useful extensions of 
what comes out of the box from Math.Random: reseeding, generation of 
exponential and possion deviates and generation of uniform and normal 
values within specified ranges.

The core problem here -- and below -- is what scope are we aiming at. 
In the proposal, I suggested the following scope:

The Math project shall create and maintain a library of lightweight, 
self-contained mathematics and statistics components addressing the most 
common practical problems not immediately available in the Java 
programming language or commons-lang. The guiding principles for 
commons-math will be:

1. Real-world application use cases determine priority
2. Emphasis on small, easily integrated components rather than large
    libraries with complex dependencies
3. All algorithms are fully documented and follow generally accepted
    best practices
4. In situations where multiple standard algorithms exist, use the
    Strategy pattern to support multiple implementations
5. No external dependencies beyond Commons components and the JDK

This means that we need to keep asking ourselves the question "are we 
meeting a simple application need with a lightweight component that is 
easy to use?"  Personally, I would say that the current RandomData 
interface and the JDK-based implementation satisfy this.  Of course,as 
always, I amy be wrong.


> 2.) The Distribution library at CERN has a somewhat successfull layout, 
> but I have some problems with it in terms of not being very "Bean like". 
> parameters often lack getters/setters that are easy to access via a 
> beanlike interface.
> 
> http://hoschek.home.cern.ch/hoschek/colt/V1.0.3/doc/cern/jet/random/package-summary.html

> 
> 
> 
> Finally, I feel a little wierd about replicating alot of the 
> functionality of the CERN library given that it is in production still. 
> Its stupid to overlook the efforts Wolfgang Hoschek has placed into 
> building a solid LGPL'ed open source mathematics library. I fear in some 
> ways we will only end up "replicating" his and others efforts here. I 
> wonder if Hoschek would have any interest in "standardization" of his 
> packages. Apache could work in his favor if he were interested in 
> allowing his code base to be further maintained and developed here. 
> Inviting community participation would open the code up to further 
> development, enhancement and refactoring to improve the libraries 
> infrustructure and save the replication of development. Maybe we should 
> consider contacting him at CERN and get his opinion on such an idea.
> 
> -Mark
> 

This hits a core issue that we need to think carefully about.  The same 
type of thing could be said regarding several other general-purpose math 
or stat libraries.  My personal opinion is that commons-math should 
*not* aim to become a "universal math library" with anything like the 
scope of Colt, JADE, VisualNumerics or any of the excellent libraries 
out there.  Our aim should be to provide a nicely designed and 
documented collection of simple utilities that save developers time and 
licensing pain -- similar to the other commons components.  If we end up 
"duplicating" functionality that exists elsewhere, I do not personally 
see this as a terrible outcome.  I see the ability to discuss and 
implement simple Java interfaces as a real advantage that we will get by 
some limited "re-invention".

In an early draft of the proposal, I had a guiding principle that said 
that each submission should be accompanied by (and evaluated according 
to) real-world application use cases.  I think that it would be a good 
idea to at least informally adhere to this.  So, for example, instead of 
just adding a large library of statistical routines, we would need to 
explain how each of the things to be added are widely used and how the 
design supports ease of use and integration.

Two of the things that I have submitted require some justification, 
which I will add here and if we do not agree, I will be OK with dropping 
them.

EmpiricalDistribution, EmpiricalDistributionImpl

This is useful in simulation or stub-based testing and in generating 
data for histograms.  Specifically, when things like service latencies 
or inter-arrival times are known to follow funny distributions and 
simulations need to generate values "like" those observed in production, 
they are sort of SOL unless they have something like this.  I know of no 
other open source component that provides the ability to generate data 
from an empirical distribution.  Admitedly, this stuff is an order of 
magnitude less demanded than RandomData or Univariate; but it does have 
real practical use, which may grow as testing, simulation and QOS become 
more important to developers.

ValueServer

This is a wrapper that combines EmpiricalDistribution, RandomData and 
the ability to "replay" data from a file directly so that simulation or 
stub-based testing applications can generate values in any of the 
supported modes.  Like EmpiricalDistribution, the main use is for 
stub-based testing and simulation.  What I mean by "stub-based testing" 
is load or functional testing with some or all back end service 
providers replaced by stubs that return canned responses.  The 
ValueServer can be used by the stubs to make them simulate 
production-like latency variation.



Thses are good these questions.  We need to keep asking them.

Phil

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message