spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16750) ML GaussianMixture training failed due to feature column type mistake
Date Fri, 05 Aug 2016 16:09:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409632#comment-15409632
] 

Apache Spark commented on SPARK-16750:
--------------------------------------

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/14455

> ML GaussianMixture training failed due to feature column type mistake
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16750
>                 URL: https://issues.apache.org/jira/browse/SPARK-16750
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Yanbo Liang
>            Assignee: Yanbo Liang
>             Fix For: 2.0.1, 2.1.0
>
>
> ML GaussianMixture training failed due to feature column type mistake. The feature column
type should be {{ml.linalg.VectorUDT}} but got {{mllib.linalg.VectorUDT}} by mistake.
> This bug is easy to reproduce by the following code:
> {code}
> val df = spark.createDataFrame(
>   Seq(
>     (1, Vectors.dense(0.0, 1.0, 4.0)),
>     (2, Vectors.dense(1.0, 0.0, 4.0)),
>     (3, Vectors.dense(1.0, 0.0, 5.0)),
>     (4, Vectors.dense(0.0, 0.0, 5.0)))
> ).toDF("id", "features")
> val scaler = new MinMaxScaler()
>   .setInputCol("features")
>   .setOutputCol("features_scaled")
>   .setMin(0.0)
>   .setMax(5.0)
> val gmm = new GaussianMixture()
>   .setFeaturesCol("features_scaled")
>   .setK(2)
> val pipeline = new Pipeline().setStages(Array(scaler, gmm))
> pipeline.fit(df)
> requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce
but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
> java.lang.IllegalArgumentException: requirement failed: Column features_scaled must be
of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
> 	at scala.Predef$.require(Predef.scala:224)
> 	at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
> 	at org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
> 	at org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
> 	at org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
> 	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
> 	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
> 	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> 	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> 	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
> 	at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
> 	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
> 	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
> {code}
> Why the unit tests did not complain this errors? Because some estimators/transformers
missed calling {{transformSchema(dataset.schema)}} firstly during {{fit}} or {{transform}}.
I will also add this function to all estimators/transformers who missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message