spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Max Moroz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16834) TrainValildationSplit and direct evaluation produce different scores
Date Sun, 11 Sep 2016 08:24:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15481341#comment-15481341
] 

Max Moroz commented on SPARK-16834:
-----------------------------------

[~bryanc] thanks for looking into this. I have no doubt that my code can be modified to yield
the same result as TrainValidationSplit by ensuring that it is (algorithmically) identical.
But I thought it should behave nearly identically as it stands, without modification.

In each iteration, the two metrics should of course differ. But the metrics should be random
numbers drawn from two identical distributions. Unfortunately, it's not the case.

I was thinking maybe there's a problem with permutation of the data (but it doesn't seem to
be); or perhaps TVS, unlike randomSplit, results in precisely-sized individual slices (I doubt).

> TrainValildationSplit and direct evaluation produce different scores
> --------------------------------------------------------------------
>
>                 Key: SPARK-16834
>                 URL: https://issues.apache.org/jira/browse/SPARK-16834
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>
> The two segments of code below are supposed to do the same thing: one is using TrainValidationSplit,
the other performs the same evaluation manually. However, their results are statistically
different (in my case, in a loop of 20, I regularly get ~19 True values). 
> Unfortunately, I didn't find the bug in the source code.
> {code}
> dataset = spark.createDataFrame(
>   [(Vectors.dense([0.0]), 0.0),
>    (Vectors.dense([0.4]), 1.0),
>    (Vectors.dense([0.5]), 0.0),
>    (Vectors.dense([0.6]), 1.0),
>    (Vectors.dense([1.0]), 1.0)] * 1000,
>   ["features", "label"]).cache()
> paramGrid = pyspark.ml.tuning.ParamGridBuilder().build()
> # note that test is NEVER used in this code
> # I create it only to utilize randomSplit
> for i in range(20):
>   train, test = dataset.randomSplit([0.8, 0.2])
>   tvs = pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(),

>                              estimatorParamMaps=paramGrid,
>                              evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
>                              trainRatio=0.5)
>   model = tvs.fit(train)
>   train, val, test = dataset.randomSplit([0.4, 0.4, 0.2])
>   lr=pyspark.ml.regression.LinearRegression()
>   evaluator=pyspark.ml.evaluation.RegressionEvaluator()
>   lrModel = lr.fit(train)
>   predicted = lrModel.transform(val)
>   print(model.validationMetrics[0] < evaluator.evaluate(predicted))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message