flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Theodore Vasiloudis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1742) Sample data points for MultipleLinearRegression to support proper SGD
Date Thu, 16 Apr 2015 15:00:01 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498115#comment-14498115

Theodore Vasiloudis commented on FLINK-1742:

The sample operator on Dataset could be used for this purpose and is more general.

> Sample data points for MultipleLinearRegression to support proper SGD
> ---------------------------------------------------------------------
>                 Key: FLINK-1742
>                 URL: https://issues.apache.org/jira/browse/FLINK-1742
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Priority: Minor
>              Labels: ML
> Currently the stochastic gradient descent method is applied to all data points of the
{{MultipleLinearRegression}} implementation. In order to scale to huge data sets, each MultipleLinearRegression
iteration should perform the SGD only on a random subset of data points. Therefore, proper
data point sampling should be added to the {{MultipleLinearRegression}} implementation. 
> An easy implementation would simply be a filter which flips for each data point a coin
deciding whether to take or to discard it. The downside of this approach is that the whole
data set has to be processed. It would be beneficial if a sampling operator does not have
to process the whole data set given that it knows the data set's size. This assumption should
be true for cached data sets in an iteration.

This message was sent by Atlassian JIRA

View raw message