Mailing-List: contact issues-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
Date: Wed, 8 Jul 2015 09:23:04 +0000 (UTC)
From: "ASF GitHub Bot (JIRA)" <jira@apache.org>
To: issues@flink.apache.org
Message-ID: <JIRA.12782824.1426676498000.127982.1436347384564@Atlassian.JIRA>
In-Reply-To: <JIRA.12782824.1426676498000@Atlassian.JIRA>
References: <JIRA.12782824.1426676498000@Atlassian.JIRA>
 <JIRA.12782824.1426676498040@arcas>
Subject: [jira] [Commented] (FLINK-1723) Add cross validation for model
 evaluation
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/FLINK-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14618274#comment-14618274 ] 

ASF GitHub Bot commented on FLINK-1723:
---------------------------------------

GitHub user thvasilo opened a pull request:

    https://github.com/apache/flink/pull/891

    [FLINK-1723] [ml] [WIP] Add cross validation for model evaluation

    Cross validation (CV) [1] is a standard tool to estimate the test error for a model. As such it is a crucial tool for every machine learning library.
    
    This builds upon the ongoing work on the evaluation framework for FlinkML.
    As such, the current version supports calculating the score of Predictors only, however the end goal is to be able to have CV for Estimators as well to cover the unsupervised learning case.
    
    We are using some code from the Apache Spark project, mostly simple routines for probabilistic sampling of datasets and generation of KFold CV data.
    
    More and better tests need to be added to the implementation, and the current sampling approaches probably will not work if used within an iteration.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/thvasilo/flink cross-validation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/891.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #891
    
----
commit 305b43a451af3d8bc859671476c215308fbfc7fc
Author: mikiobraun <mikiobraun@gmail.com>
Date:   2015-06-22T15:04:42Z

    Adding some first loss functions for the evaluation framework

commit bdb1a6912d2bcec29446ca4a9fbc550f2ecb8f4a
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-06-23T14:07:48Z

    Scorer for evaluation

commit 4a7593ade68f43d444a6b289191f053a4ea8b031
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-06-25T09:41:10Z

    Adds accuracy score and R^2 score. Also trying out Scores as classes instead of functions.
    
    Not too happy with the extra biolerplate of Score as classes will probably revert,
    and have objects like RegressionsScores, ClassificationScores that contain the definitions
    of the relevant scores.

commit 5c89c478bd00f168bfe48954d06367b28f948571
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-06-26T11:30:56Z

    Adds a evaluate operation for LabeledVector input

commit e7bb4b42424641d640df370cd6ace71f7f42ee8d
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-06-26T11:32:13Z

    Adds Regressor interface, and a score function for regression algorithms.

commit 3d8a6928b02b30c732f282df61613561dbf8d4fc
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-06-30T14:04:58Z

    Added Classifier intermediate class, and default score function for classifiers.

commit e1a26ed30bb784633685703892f67d51136f6060
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-01T08:20:41Z

    Going back to having scores defined in objects instead of their own classes.

commit 0dd251a5a59cd610c4df3e9a1ea3921b1a9cc2e0
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-01T13:00:37Z

    Removed ParameterMap from predict function of PredictOperation

commit 492e9a383af6285f0fdca5031d2bd7bdfe3cd511
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-02T10:21:28Z

    Reworked score functionality allow chained Predictors.
    
    All predictors must now implement a calculateScore function.
    We are for now assuming that predictors are supervised learning algorithms,
    once unsupervised learning algorithms are added this will need to be reworked.
    
    Also added an evaluate dataset operation to ALS, to allow for scoring of the
    algorithm. Default performance measure for ALS is RMSE.

commit d9715ed3a6faba78e0b34368425768e826d5a736
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-06T08:50:59Z

    Made calculateScore only take DataSet[(Double, Double)]

commit 4983c47917c2776a856271dd5ae62b2b3735c466
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-07T08:15:58Z

    Added test for DataSet.mean()

commit 250a754797869772041e8cb65e3a9498ae9244d0
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-07T09:18:40Z

    Added simple sampling algorithms, using filter()

commit 2a3de8866d3beefbb4f188494024aba96d219f97
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-07T10:10:33Z

    Added KFold splitting

commit 1febc843b38cc1b727a45c35da2eb8f1684592e6
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-07T10:39:34Z

    Made KFold into a class, added folds class parameter

commit 85f8ed0dde61cace3cbe3757e6645a999b56ebc5
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-07T12:11:45Z

    Switched from cross to mapWithBcVariable

commit 44d9251ecc965bf7d2bb40ffdf2653c99750af12
Author: Theodore Vasiloudis <tvas@sics.se>
Date:   2015-07-08T09:11:22Z

    Added crossValScore function to compute the cross-validated score for a predictor.

----


> Add cross validation for model evaluation
> -----------------------------------------
>
>                 Key: FLINK-1723
>                 URL: https://issues.apache.org/jira/browse/FLINK-1723
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Theodore Vasiloudis
>              Labels: ML
>
> Cross validation [1] is a standard tool to estimate the test error for a model. As such it is a crucial tool for every machine learning library.
> The cross validation should work with arbitrary Estimators and error metrics. A first cross validation strategy it should support is the k-fold cross validation.
> Resources:
> [1] [http://en.wikipedia.org/wiki/Cross-validation]


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)