Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAJ746X5oV23YpC-X868DCCs-qiWeYtnVLNstbBLhVHt0sFtfhw@mail.gmail.com>
References: 
 <CAJ746X69L+yARu3pq74peov1TxyfUPhtQWg3ffLJ5SQk4OmTAg@mail.gmail.com>
 <CANMXwW2-pU4EHUrva3qtSQDy+7Ca3KhV4kXAsLLXtRH5CfHXwQ@mail.gmail.com>
 <CAJ746X6oUEK=ATWQHJZmTnZxSXFex0XdjS5EWpXbw8fY9nwMZQ@mail.gmail.com>
 <CAAdrtT1dSk=KTA_6nvya83xJYydNLghb-wotXAF4Xxx1B4i3vQ@mail.gmail.com>
 <CAJ746X5oV23YpC-X868DCCs-qiWeYtnVLNstbBLhVHt0sFtfhw@mail.gmail.com>
From: Fabian Hueske <fhueske@gmail.com>
Date: Thu, 28 Apr 2016 14:47:37 +0200
Message-ID: 
 <CAAdrtT0XzUxEHMPNAUDdvgP5ELaUHdgFQYsHxPfGOx4s6T2rSQ@mail.gmail.com>
Subject: Re: General Data questions - streams vs batch
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=bcaec50fe6dd1cdf2905318af0db

--bcaec50fe6dd1cdf2905318af0db
Content-Type: text/plain; charset=UTF-8

True, flatMap does not have access to watermarks.

You can also go a bit more to the low levels and directly implement an
AbstractStreamOperator with OneInputStreamOperatorInterface.
This is kind of the base class for the built-in stream operators and it has
access to Watermarks (OneInputStreamOperator.processWatermark()).

Maybe the easiest is to simply extend StreamFlatMap and override the
processWatermark() method.

Cheers, Fabian

2016-04-28 14:40 GMT+02:00 Konstantin Kulagin <kkulagin@gmail.com>:

> Thanks Fabian,
>
> works like a charm except the case when the stream is finite (or i have a
> dataset from the beginning).
>
> In this case I need somehow identify that stream is finished and emit
> latest batch (which might have less amount of elements) to output.
> What is the best way to do that? In streams and windows we have support
> for watermarks, but I do not see similar stuff for a flatMap operation?
>
> In the sample below I need to emit values from 30 to 32 as well:
>
>   ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
>   DataSet<Tuple2<Long, String>> source = env.fromCollection(LongStream.range(0, 33).mapToObj(l ->
>     Tuple2.of(l, "This is " + l)).collect(Collectors.toList()));
>
>   source.flatMap(new RichFlatMapFunction<Tuple2<Long, String>, Tuple2<Long, String>>() {
>     List<Tuple2<Long, String>> cache = new ArrayList<>();
>
>     @Override
>     public RuntimeContext getRuntimeContext() {
>       return super.getRuntimeContext();
>     }
>
>     @Override
>     public void flatMap(Tuple2<Long, String> value, Collector<Tuple2<Long, String>> out) throws Exception {
>       cache.add(value);
>       if (cache.size() == 5) {
>         System.out.println("!!!!! " + Thread.currentThread().getId() + ":  " + Joiner.on(",").join(cache));
>         cache.stream().forEach(out::collect);
>         cache.clear();
>       }
>     }
>   }).setParallelism(2).print();
>
>   env.execute("yoyoyo");
> }
>
>
> Output (flink realted stuff excluded):
>
> !!!!! 35:  (1,This is 1),(3,This is 3),(5,This is 5),(7,This is 7),(9,This
> is 9)
> !!!!! 36:  (0,This is 0),(2,This is 2),(4,This is 4),(6,This is 6),(8,This
> is 8)
> !!!!! 35:  (11,This is 11),(13,This is 13),(15,This is 15),(17,This is
> 17),(19,This is 19)
> !!!!! 36:  (10,This is 10),(12,This is 12),(14,This is 14),(16,This is
> 16),(18,This is 18)
> !!!!! 35:  (21,This is 21),(23,This is 23),(25,This is 25),(27,This is
> 27),(29,This is 29)
> !!!!! 36:  (20,This is 20),(22,This is 22),(24,This is 24),(26,This is
> 26),(28,This is 28)
>
>
> And if you can give a bit more info on why will I have latency issues in a
> case of varying rate of arrival elements that would be perfect. Or point me
> to a direction where I can read it.
>
> Thanks!
> Konstantin.
>
> On Thu, Apr 28, 2016 at 7:26 AM, Fabian Hueske <fhueske@gmail.com> wrote:
>
>> Hi Konstantin,
>>
>> if you do not need a deterministic grouping of elements you should not
>> use a keyed stream or window.
>> Instead you can do the lookups in a parallel flatMap function. The
>> function would collect arriving elements and perform a lookup query after a
>> certain number of elements arrived (can cause high latency if the arrival
>> rate of elements is low or varies).
>> The flatmap function can be executed in parallel and does not require a
>> keyed stream.
>>
>> Best, Fabian
>>
>>
>> 2016-04-25 18:58 GMT+02:00 Konstantin Kulagin <kkulagin@gmail.com>:
>>
>>> As usual - thanks for answers, Aljoscha!
>>>
>>> I think I understood what I want to know.
>>>
>>> 1) To add few comments: about streams I was thinking about something
>>> similar to Storm where you can have one Source and 'duplicate' the same
>>> entry going through different 'path's.
>>> Something like this:
>>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_storm-user-guide/content/figures/1/figures/SpoutsAndBolts.png
>>> And later you can 'join' these separate streams back.
>>> And actually I think this is what I meant:
>>> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/JoinedStreams.html
>>> - this one actually 'joins' by window.
>>>
>>> As for 'exact-once-guarantee' I've got the difference from this paper:
>>> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink
>>> - Thanks!
>>>
>>> 2) understood, thank you very much
>>>
>>>
>>>
>>>
>>>
>>>
>>> I'll probably bother you one more time with another question:
>>>
>>> 3) Lets say I have a Source which provides raw (i.e. non-keyed) data.
>>> And lets say I need to 'enhance' each entry with some fields which I can
>>> take from a database.
>>> So I define some DbEnhanceOperation
>>>
>>> Database query might be expensive - so I would want to
>>> a) batch entries to perform queries
>>> b) be able to have several parallel DbEnhaceOperations so those will not
>>> slow down my whole processing.
>>>
>>>
>>> I do not see a way to do that?
>>>
>>>
>>> Problems:
>>>
>>> I cannot go with countWindowAll because of b) - that thing does not
>>> support several streams (correct?)
>>>
>>> So I need to create a windowed stream and for that I need to have some
>>> key - Correct? I.e cannot create windows on a stream of general object just
>>> using number of objects.
>>>
>>> I probably can 'emulate' keyed stream by providing some 'fake' key. But
>>> in this case I can parallelize only on different keys. Again - it is
>>> probably doable by introducing some AtomicLong key generator at the first
>>> place ( this part probably hard to understand - I can share details if
>>> necessary) but still looks like a bit of hack :)
>>>
>>> But the general question - if I can implement 3) 'normally' in a
>>> flink-way?
>>>
>>> Thanks!
>>> Konstantin.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Apr 25, 2016 at 10:53 AM, Aljoscha Krettek <aljoscha@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>> I'll try and answer your questions separately. First, a general remark,
>>>> although Flink has the DataSet API for batch processing and the DataStream
>>>> API for stream processing we only have one underlying streaming execution
>>>> engine that is used for both. Now, regarding the questions:
>>>>
>>>> 1) What do you mean by "parallel into 2 streams"? Maybe that could
>>>> influence my answer but I'll just give a general answer: Flink does not
>>>> give any guarantees about the ordering of elements in a Stream or in a
>>>> DataSet. This means that merging or unioning two streams/data sets will
>>>> just mean that operations see all elements in the two merged streams but
>>>> the order in which we see them is arbitrary. This means that we don't keep
>>>> buffers based on time or size or anything.
>>>>
>>>> 2) The elements that flow through the topology are not tracked
>>>> individually, each operation just receives elements, updates state and
>>>> sends elements to downstream operation. In essence this means that elements
>>>> themselves don't block any resources except if they alter some kept state
>>>> in operations. If you have a stateless pipeline that only has
>>>> filters/maps/flatMaps then the amount of required resources is very low.
>>>>
>>>> For a finite data set, elements are also streamed through the topology.
>>>> Only if you use operations that require grouping or sorting (such as
>>>> groupBy/reduce and join) will elements be buffered in memory or on disk
>>>> before they are processed.
>>>>
>>>> Two answer your last question. If you only do stateless
>>>> transformations/filters then you are fine to use either API and the
>>>> performance should be similar.
>>>>
>>>> Cheers,
>>>> Aljoscha
>>>>
>>>> On Sun, 24 Apr 2016 at 15:54 Konstantin Kulagin <kkulagin@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have some kind of general question in order to get more
>>>>> understanding of stream vs final data transformation. More specific - I am
>>>>> trying to understand 'entities' lifecycle during processing.
>>>>>
>>>>> 1) For example in a case of streams: suppose we start with some
>>>>> key-value source, parallel it into 2 streams by key. Each stream modifies
>>>>> entry's values, lets say adds some fields. And we want to merge it back
>>>>> later. How does it happen?
>>>>> Merging point will keep some finite buffer of entries? Basing on time
>>>>> or size?
>>>>>
>>>>> I understand that probably right solution in this case would be having
>>>>> one stream and achieve more more performance by increasing parallelism, but
>>>>> what if I have 2 sources from the beginning?
>>>>>
>>>>>
>>>>> 2) Also I assume that in a case of streaming each entry considered as
>>>>> 'processed' once it passes whole chain and emitted into some sink, so after
>>>>> it will not consume resources. Basically similar to what Storm is doing.
>>>>> But in a case of finite data (data sets): how big amount of data
>>>>> system will keep in memory? The whole set?
>>>>>
>>>>> I probably have some example of dataset vs stream 'mix': I need to
>>>>> *transform* big but finite chunk of data, I don't really need to do any
>>>>> 'joins', grouping or smth like that so I never need to store whole dataset
>>>>> in memory/storage. What my choice would be in this case?
>>>>>
>>>>> Thanks!
>>>>> Konstantin
>>>>>
>>>>>
>>>>>
>>>
>>
>

--bcaec50fe6dd1cdf2905318af0db
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div>True, flatMap does not have access to =
watermarks.<br><br></div>You can also go a bit more to the low levels and d=
irectly implement an AbstractStreamOperator with OneInputStreamOperatorInte=
rface.<br></div>This is kind of the base class for the built-in stream oper=
ators and it has access to Watermarks (OneInputStreamOperator.processWaterm=
ark()).<br><br></div>Maybe the easiest is to simply extend StreamFlatMap an=
d override the processWatermark() method.<br><br></div>Cheers, Fabian<br></=
div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2016-04-28 14=
:40 GMT+02:00 Konstantin Kulagin <span dir=3D"ltr">&lt;<a href=3D"mailto:kk=
ulagin@gmail.com" target=3D"_blank">kkulagin@gmail.com</a>&gt;</span>:<br><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>Thanks Fabian,<br>=
<br></div>works like a charm except the case when the stream is finite (or =
i have a dataset from the beginning). <br><br>In this case I need somehow i=
dentify that stream is finished and emit latest batch (which might have les=
s amount of elements) to output. <br></div><div>What is the best way to do =
that? In streams and windows we have support for watermarks, but I do not s=
ee similar stuff for a flatMap operation?<br><br>In the sample below I need=
 to emit values from 30 to 32 as well:<br><pre style=3D"background-color:rg=
b(255,255,255);color:rgb(0,0,0);font-family:&quot;Courier New&quot;;font-si=
ze:9pt">  ExecutionEnvironment env =3D ExecutionEnvironment.<span style=3D"=
font-style:italic">getExecutionEnvironment</span>();<br>  DataSet&lt;Tuple2=
&lt;Long, String&gt;&gt; source =3D env.fromCollection(LongStream.<span sty=
le=3D"font-style:italic">range</span>(<span style=3D"color:rgb(0,0,255)">0<=
/span>, <span style=3D"color:rgb(0,0,255)">33</span>).mapToObj(l -&gt;<br> =
   Tuple2.<span style=3D"font-style:italic">of</span>(l, <span style=3D"col=
or:rgb(0,128,0);font-weight:bold">&quot;This is &quot; </span>+ l)).collect=
(Collectors.<span style=3D"font-style:italic">toList</span>()));<br><br>  s=
ource.flatMap(<span style=3D"color:rgb(0,0,128);font-weight:bold">new </spa=
n>RichFlatMapFunction&lt;Tuple2&lt;Long, String&gt;, Tuple2&lt;Long, String=
&gt;&gt;() {<br>    List&lt;Tuple2&lt;Long, String&gt;&gt; <span style=3D"c=
olor:rgb(102,14,122);font-weight:bold">cache </span>=3D <span style=3D"colo=
r:rgb(0,0,128);font-weight:bold">new </span>ArrayList&lt;&gt;();<br><br>   =
 <span style=3D"color:rgb(128,128,0)">@Override<br></span><span style=3D"co=
lor:rgb(128,128,0)">    </span><span style=3D"color:rgb(0,0,128);font-weigh=
t:bold">public </span>RuntimeContext getRuntimeContext() {<br>      <span s=
tyle=3D"color:rgb(0,0,128);font-weight:bold">return super</span>.getRuntime=
Context();<br>    }<br><br>    <span style=3D"color:rgb(128,128,0)">@Overri=
de<br></span><span style=3D"color:rgb(128,128,0)">    </span><span style=3D=
"color:rgb(0,0,128);font-weight:bold">public void </span>flatMap(Tuple2&lt;=
Long, String&gt; value, Collector&lt;Tuple2&lt;Long, String&gt;&gt; out) <s=
pan style=3D"color:rgb(0,0,128);font-weight:bold">throws </span>Exception {=
<br>      <span style=3D"color:rgb(102,14,122);font-weight:bold">cache</spa=
n>.add(value);<br>      <span style=3D"color:rgb(0,0,128);font-weight:bold"=
>if </span>(<span style=3D"color:rgb(102,14,122);font-weight:bold">cache</s=
pan>.size() =3D=3D <span style=3D"color:rgb(0,0,255)">5</span>) {<br>      =
  System.<span style=3D"color:rgb(102,14,122);font-weight:bold;font-style:i=
talic">out</span>.println(<span style=3D"color:rgb(0,128,0);font-weight:bol=
d">&quot;!!!!! &quot; </span>+ Thread.<span style=3D"font-style:italic">cur=
rentThread</span>().getId() + <span style=3D"color:rgb(0,128,0);font-weight=
:bold">&quot;:  &quot; </span>+ Joiner.<span style=3D"font-style:italic">on=
</span>(<span style=3D"color:rgb(0,128,0);font-weight:bold">&quot;,&quot;</=
span>).join(<span style=3D"color:rgb(102,14,122);font-weight:bold">cache</s=
pan>));<br>        <span style=3D"color:rgb(102,14,122);font-weight:bold">c=
ache</span>.stream().forEach(out::collect);<br>        <span style=3D"color=
:rgb(102,14,122);font-weight:bold">cache</span>.clear();<br>      }<br>    =
}<br>  }).setParallelism(<span style=3D"color:rgb(0,0,255)">2</span>).print=
();<br><br>  env.execute(<span style=3D"color:rgb(0,128,0);font-weight:bold=
">&quot;yoyoyo&quot;</span>);<br>}<br></pre><br></div><div>Output (flink re=
alted stuff excluded):<br><br>!!!!! 35:=C2=A0 (1,This is 1),(3,This is 3),(=
5,This is 5),(7,This is 7),(9,This is 9)<br>!!!!! 36:=C2=A0 (0,This is 0),(=
2,This is 2),(4,This is 4),(6,This is 6),(8,This is 8)<br>!!!!! 35:=C2=A0 (=
11,This is 11),(13,This is 13),(15,This is 15),(17,This is 17),(19,This is =
19)<br>!!!!! 36:=C2=A0 (10,This is 10),(12,This is 12),(14,This is 14),(16,=
This is 16),(18,This is 18)<br>!!!!! 35:=C2=A0 (21,This is 21),(23,This is =
23),(25,This is 25),(27,This is 27),(29,This is 29)<br>!!!!! 36:=C2=A0 (20,=
This is 20),(22,This is 22),(24,This is 24),(26,This is 26),(28,This is 28)=
<br></div><div><br><br></div><div>And if you can give a bit more info on wh=
y will I have latency issues in a case of varying rate of arrival elements =
that would be perfect. Or point me to a direction where I can read it.<br><=
br></div><div>Thanks!<span class=3D"HOEnZb"><font color=3D"#888888"><br></f=
ont></span></div><span class=3D"HOEnZb"><font color=3D"#888888"><div>Konsta=
ntin.<br></div></font></span></div><div class=3D"HOEnZb"><div class=3D"h5">=
<div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Apr 28, 2=
016 at 7:26 AM, Fabian Hueske <span dir=3D"ltr">&lt;<a href=3D"mailto:fhues=
ke@gmail.com" target=3D"_blank">fhueske@gmail.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>Hi Konstantin,<br=
><br></div>if you do not need a deterministic grouping of elements you shou=
ld not use a keyed stream or window.<br>Instead you can do the lookups in a=
 parallel flatMap function. The function would collect arriving elements an=
d perform a lookup query after a certain number of elements arrived (can ca=
use high latency if the arrival rate of elements is low or varies).<br></di=
v><div>The flatmap function can be executed in parallel and does not requir=
e a keyed stream. <br><br></div><div>Best, Fabian<br></div><div><br></div><=
/div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">20=
16-04-25 18:58 GMT+02:00 Konstantin Kulagin <span dir=3D"ltr">&lt;<a href=
=3D"mailto:kkulagin@gmail.com" target=3D"_blank">kkulagin@gmail.com</a>&gt;=
</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>=
<div><div><div><div><div><div><div>As usual - thanks for answers, Aljoscha!=
<br><br></div>I think I understood what I want to know. <br><br></div>1) To=
 add few comments: about streams I was thinking about something similar to =
Storm where you can have one Source and &#39;duplicate&#39; the same entry =
going through different &#39;path&#39;s. <br>Something like this: <a href=
=3D"https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_storm-user-=
guide/content/figures/1/figures/SpoutsAndBolts.png" target=3D"_blank">https=
://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_storm-user-guide/con=
tent/figures/1/figures/SpoutsAndBolts.png</a><br></div>And later you can &#=
39;join&#39; these separate streams back.<br></div><div>And actually I thin=
k this is what I meant: <a href=3D"https://ci.apache.org/projects/flink/fli=
nk-docs-master/api/java/org/apache/flink/streaming/api/datastream/JoinedStr=
eams.html" target=3D"_blank">https://ci.apache.org/projects/flink/flink-doc=
s-master/api/java/org/apache/flink/streaming/api/datastream/JoinedStreams.h=
tml</a> - this one actually &#39;joins&#39; by window.<br></div><div><br></=
div>As for &#39;exact-once-guarantee&#39; I&#39;ve got the difference from =
this paper: <a href=3D"http://data-artisans.com/high-throughput-low-latency=
-and-exactly-once-stream-processing-with-apache-flink" target=3D"_blank">ht=
tp://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-=
processing-with-apache-flink</a> - Thanks!<br><br></div><div>2) understood,=
 thank you very much<br></div><br><br><br><br><br><br>I&#39;ll probably bot=
her you one more time with another question:<br><br></div>3) Lets say I hav=
e a Source which provides raw (i.e. non-keyed) data. And lets say I need to=
 &#39;enhance&#39; each entry with some fields which I can take from a data=
base. <br>So I define some DbEnhanceOperation<br><br>Database query might b=
e expensive - so I would want to<br></div>a) batch entries to perform queri=
es<br></div>b) be able to have several parallel DbEnhaceOperations so those=
 will not slow down my whole processing.<br><br></div><br>I do not see a wa=
y to do that?<br><br><br>Problems: <br><br>I cannot go with countWindowAll =
because of b) - that thing does not support several streams (correct?)<br><=
br>So I need to create a windowed stream and for that I need to have some k=
ey - Correct? I.e cannot create windows on a stream of general object just =
using number of objects. <br><br></div><div>I probably can &#39;emulate&#39=
; keyed stream by providing some &#39;fake&#39; key. But in this case I can=
 parallelize only on different keys. Again - it is probably doable by intro=
ducing some AtomicLong key generator at the first place ( this part probabl=
y hard to understand - I can share details if necessary) but still looks li=
ke a bit of hack :) <br><br></div><div>But the general question - if I can =
implement 3) &#39;normally&#39; in a flink-way?<br><br></div><div>Thanks!<s=
pan><font color=3D"#888888"><br></font></span></div><span><font color=3D"#8=
88888"><div>Konstantin.<br></div><div><br><br></div><div><span style=3D"bac=
kground-color:rgb(228,228,255)"><br><br></span></div><div><span style=3D"ba=
ckground-color:rgb(228,228,255)"><br></span></div><span style=3D"background=
-color:rgb(228,228,255)"></span><div><span style=3D"background-color:rgb(22=
8,228,255)"><br><br></span><div><div><div><div><div><div><div><div><div><di=
v><br><br><br></div></div></div></div></div></div></div></div></div></div><=
/div></font></span></div><div><div><div class=3D"gmail_extra"><br><div clas=
s=3D"gmail_quote">On Mon, Apr 25, 2016 at 10:53 AM, Aljoscha Krettek <span =
dir=3D"ltr">&lt;<a href=3D"mailto:aljoscha@apache.org" target=3D"_blank">al=
joscha@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><=
div dir=3D"ltr">Hi,<div>I&#39;ll try and answer your questions separately. =
First, a general remark, although Flink has the DataSet API for batch proce=
ssing and the DataStream API for stream processing we only have one underly=
ing streaming execution engine that is used for both. Now, regarding the qu=
estions:</div><div><br></div><div>1) What do you mean by &quot;parallel int=
o 2 streams&quot;? Maybe that could influence my answer but I&#39;ll just g=
ive a general answer: Flink does not give any guarantees about the ordering=
 of elements in a Stream or in a DataSet. This means that merging or unioni=
ng two streams/data sets will just mean that operations see all elements in=
 the two merged streams but the order in which we see them is arbitrary. Th=
is means that we don&#39;t keep buffers based on time or size or anything.<=
/div><div><br></div><div>2) The elements that flow through the topology are=
 not tracked individually, each operation just receives elements, updates s=
tate and sends elements to downstream operation. In essence this means that=
 elements themselves don&#39;t block any resources except if they alter som=
e kept state in operations. If you have a stateless pipeline that only has =
filters/maps/flatMaps then the amount of required resources is very low.</d=
iv><div><br></div><div>For a finite data set, elements are also streamed th=
rough the topology. Only if you use operations that require grouping or sor=
ting (such as groupBy/reduce and join) will elements be buffered in memory =
or on disk before they are processed.</div><div><br></div><div>Two answer y=
our last question. If you only do stateless transformations/filters then yo=
u are fine to use either API and the performance should be similar.</div><d=
iv><br></div><div>Cheers,</div><div>Aljoscha</div></div><div><div><br><div =
class=3D"gmail_quote"><div dir=3D"ltr">On Sun, 24 Apr 2016 at 15:54 Konstan=
tin Kulagin &lt;<a href=3D"mailto:kkulagin@gmail.com" target=3D"_blank">kku=
lagin@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
 dir=3D"ltr"><div>Hi guys,<br><br></div><div>I have some kind of general qu=
estion in order to get more understanding of stream vs final data transform=
ation. More specific - I am trying to understand &#39;entities&#39; lifecyc=
le during processing.<br><br></div><div>1) For example in a case of streams=
: suppose we start with some key-value source, parallel it into 2 streams b=
y key. Each stream modifies entry&#39;s values, lets say adds some fields. =
And we want to merge it back later. How does it happen? <br>Merging point w=
ill keep some finite buffer of entries? Basing on time or size?<br><br></di=
v><div>I understand that probably right solution in this case would be havi=
ng one stream and achieve more more performance by increasing parallelism, =
but what if I have 2 sources from the beginning?<br></div><div><br></div><d=
iv><br>2) Also I assume that in a case of streaming each entry considered a=
s &#39;processed&#39; once it passes whole chain and emitted into some sink=
, so after it will not consume resources. Basically similar to what Storm i=
s doing.<br></div><div>But in a case of finite data (data sets): how big am=
ount of data system will keep in memory? The whole set? <br><br></div><div>=
I probably have some example of dataset vs stream &#39;mix&#39;: I need to =
*transform* big but finite chunk of data, I don&#39;t really need to do any=
 &#39;joins&#39;, grouping or smth like that so I never need to store whole=
 dataset in memory/storage. What my choice would be in this case?<br><br></=
div><div>Thanks!<br></div></div><div dir=3D"ltr"><div>Konstantin<br></div><=
br><br></div></blockquote></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--bcaec50fe6dd1cdf2905318af0db--