commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles Sadowski <gil...@harfang.homelinux.org>
Subject Re: svn commit: r1174509 - in /commons/proper/math/trunk/src: main/java/org/apache/commons/math/exception/util/ main/java/org/apache/commons/math/stat/regression/ site/xdoc/ test/java/org/apache/commons/math/stat/regression/
Date Fri, 23 Sep 2011 10:51:22 GMT
Hello.

> Author: gregs
> Date: Fri Sep 23 03:36:11 2011
> New Revision: 1174509
> 
> URL: http://svn.apache.org/viewvc?rev=1174509&view=rev
> Log:
> JIRA: MATH-607 Adding support for UpdatingMultipleLinearRegression to SimpleRegression
> 
> Modified:
>     commons/proper/math/trunk/src/main/java/org/apache/commons/math/exception/util/LocalizedFormats.java

Part of the rationale for implementing exception objects was to reduce the
number of overlapping or redundant messages in the "LocalizedFormats" enum.

As I had indicated in the numerous discussions related to this subject, the
sheer size of that list of messages induces one to add more. This has just
happened again.

> 
> Modified: commons/proper/math/trunk/src/main/java/org/apache/commons/math/exception/util/LocalizedFormats.java
> URL: http://svn.apache.org/viewvc/commons/proper/math/trunk/src/main/java/org/apache/commons/math/exception/util/LocalizedFormats.java?rev=1174509&r1=1174508&r2=1174509&view=diff
> ==============================================================================
> --- commons/proper/math/trunk/src/main/java/org/apache/commons/math/exception/util/LocalizedFormats.java
(original)
> +++ commons/proper/math/trunk/src/main/java/org/apache/commons/math/exception/util/LocalizedFormats.java
Fri Sep 23 03:36:11 2011
> @@ -42,6 +42,7 @@ public enum LocalizedFormats implements 
>      // CHECKSTYLE: stop JavadocVariable
>  
>      ARGUMENT_OUTSIDE_DOMAIN("Argument {0} outside domain [{1} ; {2}]"),
> +    ARRAY_SIZE_EXCEEDS_MAX_VARIABLES("array size cannot be greater than {0}"),

With the new "ExceptionContext" functionality, one can add specialized messages
to an exception.
This approach could be taken to combine
  INPUT_ARRAY
  LENGTH
  NUMBER_TOO_LARGE
in order to convey all the information about the failure.

>      ARRAY_SIZES_SHOULD_HAVE_DIFFERENCE_1("array sizes should have difference 1 ({0}
!= {1} + 1)"),
>      ARRAY_SUMS_TO_ZERO("array sums to zero"),
>      ASSYMETRIC_EIGEN_NOT_SUPPORTED("eigen decomposition of assymetric matrices not supported
yet"),
> @@ -135,6 +136,7 @@ public enum LocalizedFormats implements 
>      INVALID_INTERVAL_INITIAL_VALUE_PARAMETERS("invalid interval, initial value parameters:
 lower={0}, initial={1}, upper={2}"),
>      INVALID_ITERATIONS_LIMITS("invalid iteration limits: min={0}, max={1}"),
>      INVALID_MAX_ITERATIONS("bad value for maximum iterations number: {0}"),
> +    NOT_ENOUGH_DATA_REGRESSION("the number of observations is not sufficient to conduct
regression"),

There already exists a
  NO_DATA
message. Instead of piling up messages with "REGRESSION", we can add a
  REGRESSION
and combine with the above to convey the same meaning.
Then, this new building block can be reused with others in order to compose
all sorts of messages about REGRESSION without lengthening the list.
  

>      INVALID_REGRESSION_ARRAY("input data array length = {0} does not match the number
of observations = {1} and the number of regressors = {2}"),
>      INVALID_REGRESSION_OBSERVATION("length of regressor array = {0} does not match the
number of variables = {1} in the model"),
>      INVALID_ROUNDING_METHOD("invalid rounding method {0}, valid methods: {1} ({2}),
{3} ({4}), {5} ({6}), {7} ({8}), {9} ({10}), {11} ({12}), {13} ({14}), {15} ({16})"),
> @@ -239,6 +241,7 @@ public enum LocalizedFormats implements 
>      NO_RESULT_AVAILABLE("no result available"),
>      NO_SUCH_MATRIX_ENTRY("no entry at indices ({0}, {1}) in a {2}x{3} matrix"),
>      NULL_NOT_ALLOWED("null is not allowed"), /* keep */
> +    ARRAY_ZERO_LENGTH_OR_NULL_NOTALLOWED("A null or zero length array not allowed"),

This is downright redundant with
  NO_DATA
  NULL_NOT_ALLOWED
Moreover exceptions already exist for those situations: "NoDataException"
and "NullArgumentException". They should be use instead of their base class.

>      COVARIANCE_MATRIX("covariance matrix"), /* keep */
>      DENOMINATOR("denominator"), /* keep */
>      DENOMINATOR_FORMAT("denominator format"), /* keep */
> 
> Modified: commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/RegressionResults.java
> URL: http://svn.apache.org/viewvc/commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/RegressionResults.java?rev=1174509&r1=1174508&r2=1174509&view=diff
> ==============================================================================
> --- commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/RegressionResults.java
(original)
> +++ commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/RegressionResults.java
Fri Sep 23 03:36:11 2011
> @@ -114,7 +114,7 @@ public class RegressionResults implement
>          this.globalFitInfo = new double[5];
>          Arrays.fill(this.globalFitInfo, Double.NaN);
>  
> -        if (rank > 2) {
> +        if (rank > 0) {
>              this.globalFitInfo[SST_IDX] = containsConstant ?
>                      (sumysq - sumy * sumy / ((double) nobs)) : sumysq;
>          }
> 
> Modified: commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/SimpleRegression.java
> URL: http://svn.apache.org/viewvc/commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/SimpleRegression.java?rev=1174509&r1=1174508&r2=1174509&view=diff
> ==============================================================================
> --- commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/SimpleRegression.java
(original)
> +++ commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/SimpleRegression.java
Fri Sep 23 03:36:11 2011
> @@ -22,8 +22,11 @@ import org.apache.commons.math.MathExcep
>  import org.apache.commons.math.exception.OutOfRangeException;
>  import org.apache.commons.math.distribution.TDistribution;
>  import org.apache.commons.math.distribution.TDistributionImpl;
> +import org.apache.commons.math.exception.MathIllegalArgumentException;
> +import org.apache.commons.math.exception.NoDataException;
>  import org.apache.commons.math.exception.util.LocalizedFormats;
>  import org.apache.commons.math.util.FastMath;
> +import org.apache.commons.math.util.MathUtils;
>  
>  /**
>   * Estimates an ordinary least squares regression model
> @@ -55,7 +58,7 @@ import org.apache.commons.math.util.Fast
>   *
>   * @version $Id$
>   */
> -public class SimpleRegression implements Serializable {
> +public class SimpleRegression implements Serializable, UpdatingMultipleLinearRegression
{
>  
>      /** Serializable version identifier */
>      private static final long serialVersionUID = -3004689053607543335L;
> @@ -98,7 +101,7 @@ public class SimpleRegression implements
>      * Secondary constructor which allows the user the ability to include/exclude const
>      * @param includeIntercept boolean flag, true includes an intercept
>      */
> -    public SimpleRegression(boolean includeIntercept){
> +    public SimpleRegression(boolean includeIntercept) {
>          super();
>          hasIntercept = includeIntercept;
>      }
> @@ -116,7 +119,7 @@ public class SimpleRegression implements
>       * @param x independent variable value
>       * @param y dependent variable value
>       */
> -    public void addData(final double x, final double y){
> +    public void addData(final double x,final double y) {
>          if (n == 0) {
>              xbar = x;
>              ybar = y;
> @@ -158,7 +161,7 @@ public class SimpleRegression implements
>       * @param x independent variable value
>       * @param y dependent variable value
>       */
> -    public void removeData(double x, double y) {
> +    public void removeData(final double x,final double y) {
>          if (n > 0) {
>              if (hasIntercept) {
>                  final double fact1 = (double) n - 1.0;
> @@ -200,13 +203,69 @@ public class SimpleRegression implements
>       * data.</p>
>       *
>       * @param data array of observations to be added
> +     * @throws ModelSpecificationException if the length of {@code data[i]} is not
> +     * greater than or equal to 2
>       */
> -    public void addData(double[][] data) {
> +    public void addData(final double[][] data) {
>          for (int i = 0; i < data.length; i++) {
> +            if( data[i].length < 2 ){
> +               throw new ModelSpecificationException(LocalizedFormats.INVALID_REGRESSION_OBSERVATION,
> +                    data[i].length, 2);
> +            }

This can be conveyed using
  DIMENSIONS_MISMATCH_SIMPLE
  DIMENSIONS_MISMATCH
  REGRESSION (to be added)

>              addData(data[i][0], data[i][1]);
>          }
> +        return;
> +    }
> +
> +    /**
> +     * Adds one observation to the regression model.
> +     *
> +     * @param x the independent variables which form the design matrix
> +     * @param y the dependent or response variable
> +     * @throws ModelSpecificationException if the length of {@code x} does not equal
> +     * the number of independent variables in the model
> +     */
> +    public void addObservation(final double[] x,final double y) throws ModelSpecificationException{
> +        if( x == null || x.length == 0 ){
> +            throw new ModelSpecificationException(LocalizedFormats.INVALID_REGRESSION_OBSERVATION,x!=null?x.length:0,
1);
> +        }
> +        addData( x[0], y );
> +        return;
>      }
>  
> +    /**
> +     * Adds a series of observations to the regression model. The lengths of
> +     * x and y must be the same and x must be rectangular.
> +     *
> +     * @param x a series of observations on the independent variables
> +     * @param y a series of observations on the dependent variable
> +     * The length of x and y must be the same
> +     * @throws ModelSpecificationException if {@code x} is not rectangular, does not
match
> +     * the length of {@code y} or does not contain sufficient data to estimate the model
> +     */
> +    public void addObservations(final double[][] x,final double[] y) {
> +        if ((x == null) || (y == null) || (x.length != y.length)) {
> +            throw new ModelSpecificationException(
> +                  LocalizedFormats.DIMENSIONS_MISMATCH_SIMPLE,
> +                  (x == null) ? 0 : x.length,
> +                  (y == null) ? 0 : y.length);
> +        }

It is wrong to clump tests for null and for array size.
The message could even be _wrong_ since you report a null as having size 0!

For null, we should throw "NullArgumentException".[1]
For zero size, "NoDataException".

> +        boolean obsOk=true;
> +        for( int i = 0 ; i < x.length; i++){
> +            if( x[i] == null || x[i].length == 0 ){
> +                obsOk = false;
> +            }
> +        }
> +        if( !obsOk ){
> +            throw new ModelSpecificationException(
> +                  LocalizedFormats.NOT_ENOUGH_DATA_FOR_NUMBER_OF_PREDICTORS,
> +                  0, 1);
> +        }

REGRESSION
NO_DATA
NUMBER_IS_TOO_SMALL

> +        for( int i = 0 ; i < x.length ; i++){
> +            addData( x[i][0], y[i] );
> +        }
> +        return;
> +    }
>  
>      /**
>       * Removes observations represented by the elements in <code>data</code>.
> @@ -265,8 +324,8 @@ public class SimpleRegression implements
>       * @param x input <code>x</code> value
>       * @return predicted <code>y</code> value
>       */
> -    public double predict(double x) {
> -        double b1 = getSlope();
> +    public double predict(final double x) {
> +        final double b1 = getSlope();
>          if (hasIntercept) {
>              return getIntercept(b1) + b1 * x;
>          }
> @@ -298,7 +357,7 @@ public class SimpleRegression implements
>       *
>       * @return true if constant exists, false otherwise
>       */
> -    public boolean hasIntercept(){
> +    public boolean hasIntercept() {
>          return hasIntercept;
>      }
>  
> @@ -572,7 +631,7 @@ public class SimpleRegression implements
>       * @return half-width of 95% confidence interval for the slope estimate
>       * @throws MathException if the confidence interval can not be computed.
>       */
> -    public double getSlopeConfidenceInterval(double alpha)
> +    public double getSlopeConfidenceInterval(final double alpha)
>          throws MathException {
>          if (alpha >= 1 || alpha <= 0) {
>              throw new OutOfRangeException(LocalizedFormats.SIGNIFICANCE_LEVEL,
> @@ -620,7 +679,7 @@ public class SimpleRegression implements
>      * @param slope current slope
>      * @return the intercept of the regression line
>      */
> -    private double getIntercept(double slope){
> +    private double getIntercept(final double slope) {
>        if( hasIntercept){
>          return (sumY - slope * sumX) / n;
>        }
> @@ -633,7 +692,134 @@ public class SimpleRegression implements
>       * @param slope regression slope estimate
>       * @return sum of squared deviations of predicted y values
>       */
> -    private double getRegressionSumSquares(double slope) {
> +    private double getRegressionSumSquares(final double slope) {
>          return slope * slope * sumXX;
>      }
> +
> +    /**
> +     * Performs a regression on data present in buffers and outputs a RegressionResults
object
> +     * @return RegressionResults acts as a container of regression output
> +     * @throws ModelSpecificationException if the model is not correctly specified
> +     */
> +    public RegressionResults regress() throws ModelSpecificationException{
> +        if( hasIntercept ){
> +          if( n < 3 ){
> +              throw new NoDataException( LocalizedFormats.NOT_ENOUGH_DATA_REGRESSION
);
> +          }

This is slightly misleading ("no data" or "not enough data"?).

> +          if( FastMath.abs( sumXX ) > MathUtils.SAFE_MIN ){
> +              final double[] params = new double[]{ getIntercept(), getSlope() };
> +              final double mse = getMeanSquareError();
> +              final double _syy = sumYY + sumY * sumY / ((double) n);
> +              final double[] vcv = new double[]{
> +                mse * (xbar *xbar /sumXX + 1.0 / ((double) n)),
> +                -xbar*mse/sumXX,
> +                mse/sumXX };
> +              return new RegressionResults(
> +                      params, new double[][]{vcv}, true, n, 2,
> +                      sumY, _syy, getSumSquaredErrors(),true,false);
> +          }else{
> +              final double[] params = new double[]{ sumY/((double) n), Double.NaN };
> +              //final double mse = getMeanSquareError();
> +              final double[] vcv = new double[]{
> +                ybar / ((double) n - 1.0),
> +                Double.NaN,
> +                Double.NaN };
> +              return new RegressionResults(
> +                      params, new double[][]{vcv}, true, n, 1,
> +                      sumY, sumYY, getSumSquaredErrors(),true,false);
> +          }
> +        }else{
> +          if( n < 2 ){
> +              throw new NoDataException( LocalizedFormats.NOT_ENOUGH_DATA_REGRESSION
);
> +          }

Same as above.

> +          if( !Double.isNaN(sumXX) ){
> +          final double[] vcv = new double[]{ getMeanSquareError() / sumXX };
> +          final double[] params = new double[]{ sumXY/sumXX };
> +          return new RegressionResults(
> +                      params, new double[][]{vcv}, true, n, 1,
> +                      sumY, sumYY, getSumSquaredErrors(),false,false);
> +          }else{
> +          final double[] vcv = new double[]{Double.NaN };
> +          final double[] params = new double[]{ Double.NaN };
> +          return new RegressionResults(
> +                      params, new double[][]{vcv}, true, n, 1,
> +                      Double.NaN, Double.NaN, Double.NaN,false,false);
> +          }
> +        }
> +    }
> +
> +    /**
> +     * Performs a regression on data present in buffers including only regressors
> +     * indexed in variablesToInclude and outputs a RegressionResults object
> +     * @param variablesToInclude an array of indices of regressors to include
> +     * @return RegressionResults acts as a container of regression output
> +     * @throws ModelSpecificationException if the model is not correctly specified
> +     * @throws MathIllegalArgumentException if the variablesToInclude array is null
or zero length
> +     * @throws OutOfRangeException if a requested variable is not present in model
> +     */
> +    public RegressionResults regress(int[] variablesToInclude) throws ModelSpecificationException{
> +        if( variablesToInclude == null || variablesToInclude.length == 0){
> +          throw new MathIllegalArgumentException(LocalizedFormats.ARRAY_ZERO_LENGTH_OR_NULL_NOTALLOWED);
> +        }

Clumped tests.

> +        if( variablesToInclude.length > 2 || (variablesToInclude.length > 1 &&
!hasIntercept) ){
> +            throw new ModelSpecificationException(
> +                    LocalizedFormats.ARRAY_SIZE_EXCEEDS_MAX_VARIABLES,
> +                    (variablesToInclude.length > 1 && !hasIntercept) ? 1
: 2);
> +        }

REGRESSION
NUMBER_IS_TOO_LARGE

> +        if( hasIntercept ){
> +            if( variablesToInclude.length == 2 ){
> +                if( variablesToInclude[0] == 1 ){
> +                    throw new ModelSpecificationException(LocalizedFormats.NOT_INCREASING_SEQUENCE);

This one should throw "NonMonotonicSequenceException".
I really don't see the value of clumping completely different failure into
the same "ModelSpecificationException". This renders the exception type
useless.
[The point of having typed exceptionz is to provide the flexibility of
dealing with them programmatically. When the actual causes of failure varies
wildly, this won't be possible.]

> +                }else if( variablesToInclude[0] != 0 ){
> +                    throw new OutOfRangeException( variablesToInclude[0], 0,1 );
> +                }
> +                if( variablesToInclude[1] != 1){
> +                     throw new OutOfRangeException( variablesToInclude[0], 0,1 );
> +                }
> +                return regress();
> +            }else{
> +                if( variablesToInclude[0] != 1 && variablesToInclude[0] != 0
){
> +                     throw new OutOfRangeException( variablesToInclude[0],0,1 );
> +                }
> +                final double _mean = sumY * sumY / ((double) n);
> +                final double _syy = sumYY + _mean;
> +                if( variablesToInclude[0] == 0 ){
> +                    //just the mean
> +                    final double[] vcv = new double[]{ sumYY/((double)((n-1)*n)) };
> +                    final double[] params = new double[]{ ybar };
> +                    return new RegressionResults(
> +                      params, new double[][]{vcv}, true, n, 1,
> +                      sumY, _syy+_mean, sumYY,true,false);
> +
> +                }else if( variablesToInclude[0] == 1){
> +                    //final double _syy = sumYY + sumY * sumY / ((double) n);
> +                    final double _sxx = sumXX + sumX * sumX / ((double) n);
> +                    final double _sxy = sumXY + sumX * sumY / ((double) n);
> +                    final double _sse = FastMath.max(0d, _syy - _sxy * _sxy / _sxx);
> +                    final double _mse = _sse/((double)(n-1));
> +                    if( !Double.isNaN(_sxx) ){
> +                        final double[] vcv = new double[]{ _mse / _sxx };
> +                        final double[] params = new double[]{ _sxy/_sxx };
> +                        return new RegressionResults(
> +                                    params, new double[][]{vcv}, true, n, 1,
> +                                    sumY, _syy, _sse,false,false);
> +                    }else{
> +                        final double[] vcv = new double[]{Double.NaN };
> +                        final double[] params = new double[]{ Double.NaN };
> +                        return new RegressionResults(
> +                                    params, new double[][]{vcv}, true, n, 1,
> +                                    Double.NaN, Double.NaN, Double.NaN,false,false);
> +                    }
> +                }
> +            }
> +        }else{
> +            if( variablesToInclude[0] != 0 ){
> +                throw new OutOfRangeException(variablesToInclude[0],0,0);
> +            }
> +            return regress();
> +        }
> +
> +        return null;
> +    }
>  }
> 
> Modified: commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/UpdatingMultipleLinearRegression.java
> URL: http://svn.apache.org/viewvc/commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/UpdatingMultipleLinearRegression.java?rev=1174509&r1=1174508&r2=1174509&view=diff
> ==============================================================================
> --- commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/UpdatingMultipleLinearRegression.java
(original)
> +++ commons/proper/math/trunk/src/main/java/org/apache/commons/math/stat/regression/UpdatingMultipleLinearRegression.java
Fri Sep 23 03:36:11 2011
> @@ -61,7 +61,7 @@ public interface UpdatingMultipleLinearR
>       * @throws ModelSpecificationException if {@code x} is not rectangular, does not
match
>       * the length of {@code y} or does not contain sufficient data to estimate the model
>       */
> -    void addObservations(double[][] x, double[] y);
> +    void addObservations(double[][] x, double[] y) throws ModelSpecificationException;

This is exactly what I say just above: a single exception for totally
unrelated failures.
[And it is yet another example of mixing (i.e. confusing) library and
application concerns: The library must check the preconditions that are
needed to perform its work, and throw an exception that it commensurate with
the failed test. Then, the application layer (the caller) can possibly
translate this into a "business" language. At the level of CM, we lack the
full context in order to be sure that one fixed high-level message wil be
adequate in all circumstances. As I've written numerous times also: When
designing code for CM, we must strive to separate the roles of "CM user" and
"CM developer".]

>      /**
>       * Clears internal buffers and resets the regression model. This means all
> 
> [...]

Also, please be careful of the code formatting style. This commit contains
many violations of the following rules:
  * There should be a space character before a "{" character.
  * There should be a space character on both sides of a keyword.
  * There should be a space on both sides of an operator.
  * There should not be a space after an opening parenthesis.
  * There should not be a space before a closing parenthesis.
  * There should not be a space before a ";" character.
  * Variables (instance or local) should not contain a "_" character.
  * Indents should be (muliples of) 4 space characters wide.


Thanks and best regards,
Gilles

[1] There is an unsettled issue about whether the CM code should check for
    "null" (and throw "NullArgumentException") or let the JVM throw the
    standard NPE.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message