spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-25959) Difference in featureImportances results on computed vs saved models
Date Tue, 20 Nov 2018 14:48:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693326#comment-16693326
] 

Sean Owen commented on SPARK-25959:
-----------------------------------

Yes 2.2 is all but EOL. I am worried about the binary incompatibility issue, and that's why
I didn't back-port. Even if the incompatibility isn't in the apparent user-visible API, I
wonder if it will cause problems at link time nonetheless. I didn't test it. Is it possible
to submit a job compiled from master against an older cluster and just check that it doesn't
fail?

> Difference in featureImportances results on computed vs saved models
> --------------------------------------------------------------------
>
>                 Key: SPARK-25959
>                 URL: https://issues.apache.org/jira/browse/SPARK-25959
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.2.0
>            Reporter: Suraj Nayak
>            Assignee: Marco Gaido
>            Priority: Major
>             Fix For: 3.0.0
>
>
> I tried to implement GBT and found that the feature Importance computed while the model
was fit is different when the same model was saved into a storage and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again and loaded,
the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am missing some
parameter that need to be set before saving the model (thus model is picking some defaults
- causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message