Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 10E9C18E0E for ; Fri, 7 Aug 2015 03:39:46 +0000 (UTC) Received: (qmail 54416 invoked by uid 500); 7 Aug 2015 03:39:46 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 54383 invoked by uid 500); 7 Aug 2015 03:39:45 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 54371 invoked by uid 99); 7 Aug 2015 03:39:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Aug 2015 03:39:45 +0000 Date: Fri, 7 Aug 2015 03:39:45 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-1901) Create sample operator for Dataset MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661258#comment-14661258 ] ASF GitHub Bot commented on FLINK-1901: --------------------------------------- Github user ChengXiangLi commented on the pull request: https://github.com/apache/flink/pull/949#issuecomment-128580607 Hi, @tillrohrmann , current implementation of sample with fixed size would generate fixed size sample for each partition randomly instead of the whole dataset, user may expect the later one actually most of the time. I'm research on how to sample fixed size elements randomly from distributed data stream, i think we can pause this PR review until i merge the previous fix. > Create sample operator for Dataset > ---------------------------------- > > Key: FLINK-1901 > URL: https://issues.apache.org/jira/browse/FLINK-1901 > Project: Flink > Issue Type: Improvement > Components: Core > Reporter: Theodore Vasiloudis > Assignee: Chengxiang Li > > In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. > We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)