Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 25 Sep 2017 23:19:00 +0000 (UTC)
From: "Bryan Cutler (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13102730.1505528397000.198999.1506381540534@Atlassian.JIRA>
In-Reply-To: <JIRA.13102730.1505528397000@Atlassian.JIRA>
References: <JIRA.13102730.1505528397000@Atlassian.JIRA> <JIRA.13102730.1505528397444@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (SPARK-22034) CrossValidator's training and
 testing set with different set of labels, resulting in encoder transform
 error
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 25 Sep 2017 23:19:07 -0000


    [ https://issues.apache.org/jira/browse/SPARK-22034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179938#comment-16179938 ] 

Bryan Cutler edited comment on SPARK-22034 at 9/25/17 11:18 PM:
----------------------------------------------------------------

You would normally fit the VectorIndexer on the entire dataset and then put the resulting transformer in the pipeline for cross validation.  This is not a bug unless I'm mistaken.

For example: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala#L52


was (Author: bryanc):
You would normally fit the VectorIndexer on the entire dataset and then put the resulting transformer in the pipeline for cross validation.  This is not a bug unless I'm mistaken.

> CrossValidator's training and testing set with different set of labels, resulting in encoder transform error
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22034
>                 URL: https://issues.apache.org/jira/browse/SPARK-22034
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.2.0
>         Environment: Ubuntu 16.04
> Scala 2.11
> Spark 2.2.0
>            Reporter: AnChe Kuo
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Let's say we have a VectorIndexer with maxCategories set to 13, and training set has a column containing month label.
> In CrossValidator, dataframe is split into training and testing set automatically. If could happen that training set happens to lack month 2 (could happen by chance, or happen quite frequently if we have unbalanced label).
> When training set is being trained within the cross validator, the pipeline is fitted with the training set only, resulting in a partial key map in VectorIndexer. When this pipeline is used to transform the predict set, VectorIndexer will throw  a "key not found" error.
> Making CrossValidator also an estimator thus can be connected to a whole pipeline is a cool idea, but bug like this occurs, and is not expected.
> The solution, I am guessing, would be to check each stage in the pipeline, and when we see encoder type stage, we fit the stage model with the complete dataset.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org