commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] Multiple Algorithms
Date Thu, 01 Sep 2011 06:11:16 GMT
On 8/31/11 10:05 PM, Greg Sterijevski wrote:
> Hello All,
>
> This question popped into my head this evening, what is the right way to
> handle multiple algorithms which purport to calculate the same thing? There
> are, for example, a couple of ways to calculate the student t cdf. What is
> the common's philosophy on deciding:
>
> 1. Whether to allow multiple algorithms.
> 2. How an algorithm is included.
>     a.) Does a 'bug' or shortcoming need to be shown in the current
> implementation?
>     b.) Say that algorithm a works for a numerical range and b works best on
> another. Are both included? Is a new 'meta' algorithm written which mixes
> both a and b?
> 3. Does simplicity count?
> 4. Does speed matter?
>
> A while back, Chris Nix reimplemented the SVD routine. I am not sure I
> remember the old routine so I cannot say there was anything worth keeping
> there. However was there a conscious decision to scrap it? Why not have it
> live side by side with the new one? (Again, I am not saying the old
> algorithm was better-Chris' contribution definitely was valuable). I think
> we will run into these issues often.
>
> Thoughts? If this has been discussed already, my apologies.

Well, at a high level, we tried to settle this at the very
beginning.  Have a look at items 3 and 4 in the "guiding principles"
on the [math] main page :)  That stuff comes from the original
proposal for [math] and we have tried to stay more or less faithful
to the principles laid out there.

The integration, ode and solvers packages all try to do exactly what
you are suggesting - when multiple algorithms, or even variations on
an algorithm, exist and no single one can really do the job for all
practical use cases, we welcome and incorporate implementations of
multiple different ones.  How we do that varies by package and this
is one place where our "interface pain" has led to some trauma. 
Initially, we pretty much uniformly separated interfaces from
implementation precisely for this reason.  You can still see that in
the distributions package and while it is a little ironic since we
have been talking about collapsing the interfaces into the impls
there, the setup now supports multiple different implementations of
any of the distributions.

I don't want to turn this into an abstract discussion of how best to
support the strategy pattern in [math].  I think the "how to do it"
part depends on the problem / package.  What we have tried to do so
far and I think we should keep trying to do is:

0) If there is a single best algorithm that really does work well in
almost all cases, make that "the obvious thing" and try to make our
implementation of it as robust as possible.  If we provide only one
implementation, try to make it the best one.  I would say put
SimpleRegression, or Variance, for example, in this category.

1) When multiple different standard algorithms exist, make sure our
API supports adding alternatives - including user-supplied
alternatives - in the least-confusing way we can think of.  Welcome
contributions of the alternatives and try to document as best we can
how and when to use the different implementations.  Try to stick to
standard algorithms, with good external references, so we don't have
to turn our javadoc and/or user guide into a numerical analysis
textbook.

2) Whenever possible, try to encapsulate the part that varies for
different implementations, so the API remains simple and the
variable part can be "plugged in."  Good examples of this are the
RandomGenerators or the Solvers used by different classes.

The general question is a good one; but the specific answer depends
on the algorithms and classes involved.  In the two cases that you
mention (t distribution cdf and SVD), we have to balance API
complexity and more code to support against practical value.  SVD
has been a big challenge for us, so I would say we should start at
step 0) on that one; but I could easily be talked into supporting
multiple impls given strong numerical arguments and multiple good,
robust impls.  For the t distribution, I am not so sure.  A separate
thread for that is probably best.  My intuition there is that like
the other distributions, we and our users are best off with a single
impl that does whatever it needs to do in different ranges to
provide robust estimates, possibly allowing configuration settings
to trade performance for accuracy.

Phil

>
> -Greg
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message