Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5E7931801A for ; Fri, 5 Feb 2016 07:56:28 +0000 (UTC) Received: (qmail 21877 invoked by uid 500); 5 Feb 2016 07:56:28 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 21767 invoked by uid 500); 5 Feb 2016 07:56:28 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 21757 invoked by uid 99); 5 Feb 2016 07:56:28 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Feb 2016 07:56:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 9CC03C1571 for ; Fri, 5 Feb 2016 07:56:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 4fe6Z2W9CJCx for ; Fri, 5 Feb 2016 07:56:26 +0000 (UTC) Received: from mail-lb0-f181.google.com (mail-lb0-f181.google.com [209.85.217.181]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id A45622304B for ; Fri, 5 Feb 2016 07:56:25 +0000 (UTC) Received: by mail-lb0-f181.google.com with SMTP id x4so45793839lbm.0 for ; Thu, 04 Feb 2016 23:56:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=jPkIFwKV12aChOXn2Qq4zItLFTwoXiNHfIBJYtWsiTs=; b=ulIzDNXGzseTky57BiWX4UKEgnpywrMbSbD0O6DXupjcYtyPDKgnmLJVouQGKibSAw CHlKYQb7bhVVkXIUiAbx+2mVybrmAsdJn81G0X2CZNcHlxzRj2EWixm2FvfQmVZkclCF WF4do/ED4poNpJCPq8SfOz4HvL9sgnXA+7j8RJAojZUaWXW4cd+T/5QvMO38AKltc3sj 1IDcByvZr8+1Xo+vLUu9rSy1sXkjSfVKnBv5moFAeQ++f6gqD9DOLCXusIDFpq8DcyXr mb7oFV4cNOI+In1ONzJbhDkcj1bXgt5doxSUHz4WYjhak1vguTQqhcSIu/qbgzB2Okv1 MmSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-type; bh=jPkIFwKV12aChOXn2Qq4zItLFTwoXiNHfIBJYtWsiTs=; b=itOoz+bNFbMovtDW/CSIuOCVRVysoTQr2l67FLCCTOde+YoLhgzCJ8enq13cHILvHi G9S6s+18wUhIUCcWVaStngzGs3Mj0Cm59UsnLmfv4qqeQgP8HyM/IgfSXsX1Gwmg1ZWI eVqATlUg/7DK5EfMCCp2vI59iQ/TlLg/T5KZv+SPMGIoU+aS5ObeKwkEslJ3krZ2oE6r xj76p5htEJriqMBFEMazwDL0x9zO/8b9XiQ71O8dmMJG5TiYOSlbtPjwwEOD6ygMHN9Z khQJSWA8eJxCVlsN6rtdwTYptZQbsNqyie6qrmczMoRI90PH0+f59CcN/NoJmf7P0GQo 1cgg== X-Gm-Message-State: AG10YOQK7WEIRtu28qPnlLnoJ3vFQy+g/Q9arcQQ2vsSwG0yY9oobGcmI7dR68Kc2mCcw8ZT0WxcNcWyUwHMkA== X-Received: by 10.112.145.131 with SMTP id su3mr5337377lbb.19.1454658985013; Thu, 04 Feb 2016 23:56:25 -0800 (PST) MIME-Version: 1.0 References: <56B31839.9000904@apache.org> In-Reply-To: From: Jeyhun Karimov Date: Fri, 05 Feb 2016 07:56:15 +0000 Message-ID: Subject: Re: Stream conversion To: "user@flink.apache.org" Content-Type: multipart/alternative; boundary=047d7b3a87cc123648052b0130b3 --047d7b3a87cc123648052b0130b3 Content-Type: text/plain; charset=UTF-8 For example, I will do aggregate operations with other windows (n-window aggregations) that are already outputted. I tried your suggestion and used filesystem sink, outputted to HDFS. I got k files in HDFS directory where k is the number of parallelism (I used single machine). These files get bigger (new records are appended) as stream continues. Because they are (outputted files) not closed and file size is changed regularly, would this cause some problems while processing data with dataset api or hadoop or another library? On Thu, Feb 4, 2016 at 2:14 PM Robert Metzger wrote: > I'm wondering which kind of transformations you want to apply to the > window you cannot apply with the DataStream API? > > Would it be sufficient for you to have the windows as files in HDFS and > then run batch jobs against the windows on disk? If so, you could use our > filesystem sink, which creates files bucketed by certain time-windows. > > On Thu, Feb 4, 2016 at 11:33 AM, Stephan Ewen wrote: > >> Hi! >> >> If I understand you correctly, what you are looking for is a kind of >> periodic batch job, where the input data for each batch is a large window. >> >> We have actually thought about this kind of application before. It is not >> on the short term road map that we shared a few weeks ago, but I think it >> will come to Flink in the mid-term (that would be in some months or so), it >> is asked for quite frequently. >> >> Implementing this as a core feature is a bit of effort. A mock that >> writes out the windows and triggers a batch job sounds not too difficult, >> actually. >> >> Greetings, >> Stephan >> >> >> On Thu, Feb 4, 2016 at 10:30 AM, Sane Lee wrote: >> >>> I have also, similar scenario. Any suggestion would be appreciated. >>> >>> On Thu, Feb 4, 2016 at 10:29 AM Jeyhun Karimov >>> wrote: >>> >>>> Hi Matthias, >>>> >>>> This need not to be necessarily in api functions. I just want to get a >>>> roadmap to add this functionality. Should I save each window's data into >>>> disk and create a new dataset environment in parallel? Or change trigger >>>> functionality maybe? >>>> >>>> I have large windows. As I asked in previous question, in flink the >>>> problem with large windows (that data inside windows may not fit in memory) >>>> will be solved. So, after getting the data of window, I want to do more >>>> than the functions in stream api. Therefore I need to convert it to >>>> dataset. Any roadmap would be appreciated. >>>> >>>> On Thu, Feb 4, 2016 at 10:23 AM Matthias J. Sax >>>> wrote: >>>> >>>>> Hi Sane, >>>>> >>>>> Currently, DataSet and DataStream API a strictly separated. Thus, this >>>>> is not possible at the moment. >>>>> >>>>> What kind of operation do you want to perform on the data of a window? >>>>> Why do you want to convert the data into a data set? >>>>> >>>>> -Matthias >>>>> >>>>> On 02/04/2016 10:11 AM, Sane Lee wrote: >>>>> > Dear all, >>>>> > >>>>> > I want to convert the data from each window of stream to dataset. >>>>> What >>>>> > is the best way to do that? So, while streaming, at the end of each >>>>> > window I want to convert those data to dataset and possible apply >>>>> > dataset transformations to it. >>>>> > Any suggestions? >>>>> > >>>>> > -best >>>>> > -sane >>>>> >>>>> >> > --047d7b3a87cc123648052b0130b3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
For example, I will do aggregate operations with other win= dows (n-window aggregations) that are already outputted.
I tried your s= uggestion and used filesystem sink, outputted to HDFS.
=C2=A0I go= t k files in HDFS directory where k is the number of parallelism (I used si= ngle machine).
These files get bigger (new records are appended) = as stream continues. Because they are (outputted files) not closed and file= size is changed regularly, would this cause some problems while processing= data with dataset api or hadoop or another library?



On Thu, Feb 4, 2016 at 2:14 PM Robert Metzger <rmetzger@apache.org> wrote:
I'm wondering which kind of transform= ations you want to apply to the window you cannot apply with the DataStream= API?

Would it be sufficient for you to have the windows= as files in HDFS and then run batch jobs against the windows on disk? If s= o, you could use our filesystem sink, which creates files bucketed by certa= in time-windows.

On Thu, Feb 4, 2016 at 11:33 AM, Stephan Ewen = <sewen@apache.org<= /a>> wrote:
Hi= !

If I understand you correctly, what you are looking fo= r is a kind of periodic batch job, where the input data for each batch is a= large window.

We have actually thought about this= kind of application before. It is not on the short term road map that we s= hared a few weeks ago, but I think it will come to Flink in the mid-term (t= hat would be in some months or so), it is asked for quite frequently.
=

Implementing this as a core feature is a bit of effort.= A mock that writes out the windows and triggers a batch job sounds not too= difficult, actually.

Greetings,
Stephan=


On Thu, Feb 4, 2016 at 10:30 AM, Sane Lee <leesane8@g= mail.com> wrote:
I have also, similar scenario. Any suggestion would be appreciated= .=C2=A0

On T= hu, Feb 4, 2016 at 10:29 AM Jeyhun Karimov <je.karimov@gmail.com> wrote:
=
Hi Matthias,

=
This need not to be necessarily in api functions. I just want to get a= roadmap to add this functionality. Should I save each window's data in= to disk and create a new dataset environment in parallel? Or change trigger= functionality maybe?=C2=A0

I have large windows. = As I asked in previous question, in flink the problem with large windows (t= hat data inside windows may not fit in memory) will be solved. So, after ge= tting the data of window, I want to do more than the functions in stream ap= i. Therefore I need to convert it to dataset. Any roadmap would be apprecia= ted.

On Thu, Feb= 4, 2016 at 10:23 AM Matthias J. Sax <mjsax@apache.org> wrote:
Hi Sane,

Currently, DataSet and DataStream API a strictly separated. Thus, this
is not possible at the moment.

What kind of operation do you want to perform on the data of a window?
Why do you want to convert the data into a data set?

-Matthias

On 02/04/2016 10:11 AM, Sane Lee wrote:
> Dear all,
>
> I want to convert the data from each window of stream to dataset. What=
> is the best way to do that?=C2=A0 So, while streaming, at the end of e= ach
> window I want to convert those data to dataset and possible apply
> dataset transformations to it.
> Any suggestions?
>
> -best
> -sane



--047d7b3a87cc123648052b0130b3--