flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-1742) Sample data points for MultipleLinearRegression to support proper SGD
Date Wed, 18 Mar 2015 18:41:43 GMT
Till Rohrmann created FLINK-1742:
------------------------------------

             Summary: Sample data points for MultipleLinearRegression to support proper SGD
                 Key: FLINK-1742
                 URL: https://issues.apache.org/jira/browse/FLINK-1742
             Project: Flink
          Issue Type: Improvement
          Components: Machine Learning Library
            Reporter: Till Rohrmann
            Priority: Minor


Currently the stochastic gradient descent method is applied to all data points of the {{MultipleLinearRegression}}
implementation. In order to scale to huge data sets, each MultipleLinearRegression iteration
should perform the SGD only on a random subset of data points. Therefore, proper data point
sampling should be added to the {{MultipleLinearRegression}} implementation. 

An easy implementation would simply be a filter which flips for each data point a coin deciding
whether to take or to discard it. The downside of this approach is that the whole data set
has to be processed. It would be beneficial if a sampling operator does not have to process
the whole data set given that it knows the data set's size. This assumption should be true
for cached data sets in an iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message