flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1901) Create sample operator for Dataset
Date Mon, 10 Aug 2015 11:11:45 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679955#comment-14679955

ASF GitHub Bot commented on FLINK-1901:

Github user ChengXiangLi commented on the pull request:

    Thanks for the input, @tillrohrmann and @sachingoel0101 . I would like to implement the
fixed size sampling with only one pass through source dataset, since while user try to sample
a dataset, the dataset should be quite large in most cases, pass through the dataset multi
times would add much more effort. In my solution, the basic idea of fixed size sample in distributed
stream is that: generate a random number for each input elements as its weight, select top
K elements with max weight, as the weights are generated randomly, so the selected top K elements
are selected randomly. You can see more detail information in the code and javadoc.

> Create sample operator for Dataset
> ----------------------------------
>                 Key: FLINK-1901
>                 URL: https://issues.apache.org/jira/browse/FLINK-1901
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Theodore Vasiloudis
>            Assignee: Chengxiang Li
> In order to be able to implement Stochastic Gradient Descent and a number of other machine
learning algorithms we need to have a way to take a random sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset, choose the
relative size of the sample, and set a seed for reproducibility.

This message was sent by Atlassian JIRA

View raw message