Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1F20A200B74 for ; Wed, 27 Jul 2016 11:00:31 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 1E0F8160A6E; Wed, 27 Jul 2016 09:00:31 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 66C34160A93 for ; Wed, 27 Jul 2016 11:00:30 +0200 (CEST) Received: (qmail 65603 invoked by uid 500); 27 Jul 2016 09:00:29 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 65449 invoked by uid 99); 27 Jul 2016 09:00:29 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jul 2016 09:00:29 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 173692C0D5F for ; Wed, 27 Jul 2016 09:00:29 +0000 (UTC) Date: Wed, 27 Jul 2016 09:00:29 +0000 (UTC) From: "Nick Pentreath (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 27 Jul 2016 09:00:31 -0000 [ https://issues.apache.org/jira/browse/SPARK-14489?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D153= 95275#comment-15395275 ]=20 Nick Pentreath commented on SPARK-14489: ---------------------------------------- Thanks for the thoughts Krishna. # Initially I also thought a flag to ignore NaN in the evaluators would mak= e sense. However frankly I have never seen (and I can't think of) a situati= on where this is desirable, _outside_ of this situation where splitting the= dataset can result in user/item ids the model has not been trained on (thi= s applies in general to "ranking" cases). But for all other typical supervi= sed learning cases, NaN means either (a) NaN inputs, in which case that sho= uld be dealt with by the user in the pipeline before training; (b) a model = that has bad coefficients. In both these cases, I'd argue that it is correc= t to return NaN, and not desirable to ignore NaN; > RegressionEvaluator returns NaN for ALS in Spark ml > --------------------------------------------------- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.0 > Environment: AWS EMR > Reporter: Boris Cl=C3=A9men=C3=A7on=20 > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metric= s "rmse", "mse", "r2" and "mae" all return NaN.=20 > The reason is in CrossValidator.scala line 109. The K-folds are randomly = generated. For large and sparse datasets, there is a significant probabilit= y that at least one user of the validation set is missing in the training s= et, hence generating a few NaN estimation with transform method and NaN Reg= ressionEvaluator's metrics too.=20 > Suggestion to fix the bug: remove the NaN values while computing the rmse= or other metrics (ie, removing users or items in validation test that is m= issing in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=3DBar.scala|borderStyle=3Dsolid} > val splits =3D MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitInde= x) =3D> > val trainingDataset =3D sqlCtx.createDataFrame(training, schema).ca= che() > val validationDataset =3D sqlCtx.createDataFrame(validation, schema= ).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters= .") > val models =3D est.fit(trainingDataset, epm).asInstanceOf[Seq[Model= [_]]] > trainingDataset.unpersist() > var i =3D 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric =3D eval.evaluate(models(i).transform(validationDatase= t, epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) +=3D metric > i +=3D 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org