Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BE7AB18715 for ; Wed, 8 Jul 2015 09:23:04 +0000 (UTC) Received: (qmail 14224 invoked by uid 500); 8 Jul 2015 09:23:04 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 14183 invoked by uid 500); 8 Jul 2015 09:23:04 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 14173 invoked by uid 99); 8 Jul 2015 09:23:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jul 2015 09:23:04 +0000 Date: Wed, 8 Jul 2015 09:23:04 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-1723) Add cross validation for model evaluation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14618274#comment-14618274 ] ASF GitHub Bot commented on FLINK-1723: --------------------------------------- GitHub user thvasilo opened a pull request: https://github.com/apache/flink/pull/891 [FLINK-1723] [ml] [WIP] Add cross validation for model evaluation Cross validation (CV) [1] is a standard tool to estimate the test error for a model. As such it is a crucial tool for every machine learning library. This builds upon the ongoing work on the evaluation framework for FlinkML. As such, the current version supports calculating the score of Predictors only, however the end goal is to be able to have CV for Estimators as well to cover the unsupervised learning case. We are using some code from the Apache Spark project, mostly simple routines for probabilistic sampling of datasets and generation of KFold CV data. More and better tests need to be added to the implementation, and the current sampling approaches probably will not work if used within an iteration. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thvasilo/flink cross-validation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/891.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #891 ---- commit 305b43a451af3d8bc859671476c215308fbfc7fc Author: mikiobraun Date: 2015-06-22T15:04:42Z Adding some first loss functions for the evaluation framework commit bdb1a6912d2bcec29446ca4a9fbc550f2ecb8f4a Author: Theodore Vasiloudis Date: 2015-06-23T14:07:48Z Scorer for evaluation commit 4a7593ade68f43d444a6b289191f053a4ea8b031 Author: Theodore Vasiloudis Date: 2015-06-25T09:41:10Z Adds accuracy score and R^2 score. Also trying out Scores as classes instead of functions. Not too happy with the extra biolerplate of Score as classes will probably revert, and have objects like RegressionsScores, ClassificationScores that contain the definitions of the relevant scores. commit 5c89c478bd00f168bfe48954d06367b28f948571 Author: Theodore Vasiloudis Date: 2015-06-26T11:30:56Z Adds a evaluate operation for LabeledVector input commit e7bb4b42424641d640df370cd6ace71f7f42ee8d Author: Theodore Vasiloudis Date: 2015-06-26T11:32:13Z Adds Regressor interface, and a score function for regression algorithms. commit 3d8a6928b02b30c732f282df61613561dbf8d4fc Author: Theodore Vasiloudis Date: 2015-06-30T14:04:58Z Added Classifier intermediate class, and default score function for classifiers. commit e1a26ed30bb784633685703892f67d51136f6060 Author: Theodore Vasiloudis Date: 2015-07-01T08:20:41Z Going back to having scores defined in objects instead of their own classes. commit 0dd251a5a59cd610c4df3e9a1ea3921b1a9cc2e0 Author: Theodore Vasiloudis Date: 2015-07-01T13:00:37Z Removed ParameterMap from predict function of PredictOperation commit 492e9a383af6285f0fdca5031d2bd7bdfe3cd511 Author: Theodore Vasiloudis Date: 2015-07-02T10:21:28Z Reworked score functionality allow chained Predictors. All predictors must now implement a calculateScore function. We are for now assuming that predictors are supervised learning algorithms, once unsupervised learning algorithms are added this will need to be reworked. Also added an evaluate dataset operation to ALS, to allow for scoring of the algorithm. Default performance measure for ALS is RMSE. commit d9715ed3a6faba78e0b34368425768e826d5a736 Author: Theodore Vasiloudis Date: 2015-07-06T08:50:59Z Made calculateScore only take DataSet[(Double, Double)] commit 4983c47917c2776a856271dd5ae62b2b3735c466 Author: Theodore Vasiloudis Date: 2015-07-07T08:15:58Z Added test for DataSet.mean() commit 250a754797869772041e8cb65e3a9498ae9244d0 Author: Theodore Vasiloudis Date: 2015-07-07T09:18:40Z Added simple sampling algorithms, using filter() commit 2a3de8866d3beefbb4f188494024aba96d219f97 Author: Theodore Vasiloudis Date: 2015-07-07T10:10:33Z Added KFold splitting commit 1febc843b38cc1b727a45c35da2eb8f1684592e6 Author: Theodore Vasiloudis Date: 2015-07-07T10:39:34Z Made KFold into a class, added folds class parameter commit 85f8ed0dde61cace3cbe3757e6645a999b56ebc5 Author: Theodore Vasiloudis Date: 2015-07-07T12:11:45Z Switched from cross to mapWithBcVariable commit 44d9251ecc965bf7d2bb40ffdf2653c99750af12 Author: Theodore Vasiloudis Date: 2015-07-08T09:11:22Z Added crossValScore function to compute the cross-validated score for a predictor. ---- > Add cross validation for model evaluation > ----------------------------------------- > > Key: FLINK-1723 > URL: https://issues.apache.org/jira/browse/FLINK-1723 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Theodore Vasiloudis > Labels: ML > > Cross validation [1] is a standard tool to estimate the test error for a model. As such it is a crucial tool for every machine learning library. > The cross validation should work with arbitrary Estimators and error metrics. A first cross validation strategy it should support is the k-fold cross validation. > Resources: > [1] [http://en.wikipedia.org/wiki/Cross-validation] -- This message was sent by Atlassian JIRA (v6.3.4#6332)