spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent (JIRA)" <>
Subject [jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
Date Mon, 15 Aug 2016 11:44:20 GMT


Vincent commented on SPARK-17055:

one of the most common tasks is to fit a "model" to a set of training data, so as to be able
to make reliable predictions on general untrained data. labelKFold can be used to test the
model's ability to generalize by evaluating its performance on a class of data not used for
training, which is assumed to approximate the typical unseen data that a model will encounter.

> add labelKFold to CrossValidator
> --------------------------------
>                 Key: SPARK-17055
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Vincent
>            Priority: Minor
> Current CrossValidator only supports k-fold, which randomly divides all the samples in
k groups of samples. But in cases when data is gathered from different subjects and we want
to avoid over-fitting, we want to hold out samples with certain labels from training data
and put them into validation fold, i.e. we want to ensure that the same label is not in both
testing and training sets.
> Mainstream packages like Sklearn already supports such cross validation method. (

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message