spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jkbradley <...@git.apache.org>
Subject [GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...
Date Tue, 19 Sep 2017 20:24:52 GMT
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19020#discussion_r139765053
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
    @@ -69,19 +69,57 @@ private[regression] trait LinearRegressionParams extends PredictorParams
         "The solver algorithm for optimization. Supported options: " +
           s"${supportedSolvers.mkString(", ")}. (Default auto)",
         ParamValidators.inArray[String](supportedSolvers))
    +
    +  /**
    +   * The loss function to be optimized.
    +   * Supported options: "leastSquares" and "huber".
    +   * Default: "leastSquares"
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final override val loss: Param[String] = new Param[String](this, "loss", "The loss
function to" +
    +    s" be optimized. Supported options: ${supportedLosses.mkString(", ")}. (Default leastSquares)",
    +    ParamValidators.inArray[String](supportedLosses))
    +
    +  /**
    +   * The shape parameter to control the amount of robustness. Must be &gt; 1.0.
    +   * At larger values of epsilon, the huber criterion becomes more similar to least squares
    +   * regression; for small values of epsilon, the criterion is more similar to L1 regression.
    +   * Default is 1.35 to get as much robustness as possible while retaining
    +   * 95% statistical efficiency for normally distributed data.
    +   * Only valid when "loss" is "huber".
    +   */
    +  @Since("2.3.0")
    +  final val epsilon = new DoubleParam(this, "epsilon", "The shape parameter to control
the " +
    +    "amount of robustness. Must be > 1.0.", ParamValidators.gt(1.0))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getEpsilon: Double = $(epsilon)
    +
    +  override protected def validateAndTransformSchema(
    +      schema: StructType,
    +      fitting: Boolean,
    +      featuresDataType: DataType): StructType = {
    +    if ($(loss) == Huber) {
    +      require($(solver)!= Normal, "LinearRegression with huber loss doesn't support "
+
    +        "normal solver, please change solver to auto or l-bfgs.")
    +      require($(elasticNetParam) == 0.0, "LinearRegression with huber loss only supports
" +
    +        s"L2 regularization, but got elasticNetParam = $getElasticNetParam.")
    +
    +    }
    +    super.validateAndTransformSchema(schema, fitting, featuresDataType)
    +  }
     }
     
     /**
      * Linear regression.
      *
    - * The learning objective is to minimize the squared error, with regularization.
    - * The specific squared error loss function used is:
    - *
    - * <blockquote>
    - *    $$
    - *    L = 1/2n ||A coefficients - y||^2^
    - *    $$
    - * </blockquote>
    + * The learning objective is to minimize the specified loss function, with regularization.
    + * This supports two loss functions:
    + *  - leastSquares (a.k.a squared loss)
    --- End diff --
    
    Let's keep exact specifications of the losses being used.  This is one of my big annoyances
with many ML libraries: It's hard to tell exactly what loss is being used, which makes it
hard to compare/validate results across different ML libraries.
    
    It'd also be nice to make it clear what we mean by "huber," in particular that we estimate
the scale parameter from data.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message