Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 27 Jul 2016 09:00:29 +0000 (UTC)
From: "Nick Pentreath (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12957192.1460114794000.143920.1469610029091@Atlassian.JIRA>
In-Reply-To: <JIRA.12957192.1460114794000@Atlassian.JIRA>
References: <JIRA.12957192.1460114794000@Atlassian.JIRA> <JIRA.12957192.1460114794398@arcas>
Subject: [jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN
 for ALS in Spark ml
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Wed, 27 Jul 2016 09:00:31 -0000


    [ https://issues.apache.org/jira/browse/SPARK-14489?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D153=
95275#comment-15395275 ]=20

Nick Pentreath commented on SPARK-14489:
----------------------------------------

Thanks for the thoughts Krishna.

# Initially I also thought a flag to ignore NaN in the evaluators would mak=
e sense. However frankly I have never seen (and I can't think of) a situati=
on where this is desirable, _outside_ of this situation where splitting the=
 dataset can result in user/item ids the model has not been trained on (thi=
s applies in general to "ranking" cases). But for all other typical supervi=
sed learning cases, NaN means either (a) NaN inputs, in which case that sho=
uld be dealt with by the user in the pipeline before training; (b) a model =
that has bad coefficients. In both these cases, I'd argue that it is correc=
t to return NaN, and not desirable to ignore NaN;

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
>                 Key: SPARK-14489
>                 URL: https://issues.apache.org/jira/browse/SPARK-14489
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.0
>         Environment: AWS EMR
>            Reporter: Boris Cl=C3=A9men=C3=A7on=20
>              Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metric=
s "rmse", "mse", "r2" and "mae" all return NaN.=20
> The reason is in CrossValidator.scala line 109. The K-folds are randomly =
generated. For large and sparse datasets, there is a significant probabilit=
y that at least one user of the validation set is missing in the training s=
et, hence generating a few NaN estimation with transform method and NaN Reg=
ressionEvaluator's metrics too.=20
> Suggestion to fix the bug: remove the NaN values while computing the rmse=
 or other metrics (ie, removing users or items in validation test that is m=
issing in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=3DBar.scala|borderStyle=3Dsolid}
>     val splits =3D MLUtils.kFold(dataset.rdd, $(numFolds), 0)
>     splits.zipWithIndex.foreach { case ((training, validation), splitInde=
x) =3D>
>       val trainingDataset =3D sqlCtx.createDataFrame(training, schema).ca=
che()
>       val validationDataset =3D sqlCtx.createDataFrame(validation, schema=
).cache()
>       // multi-model training
>       logDebug(s"Train split $splitIndex with multiple sets of parameters=
.")
>       val models =3D est.fit(trainingDataset, epm).asInstanceOf[Seq[Model=
[_]]]
>       trainingDataset.unpersist()
>       var i =3D 0
>       while (i < numModels) {
>         // TODO: duplicate evaluator to take extra params from input
>         val metric =3D eval.evaluate(models(i).transform(validationDatase=
t, epm(i)))
>         logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
>         metrics(i) +=3D metric
>         i +=3D 1
>       }
>       validationDataset.unpersist()
>     }
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org