flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Kulagin <kkula...@gmail.com>
Subject Re: General Data questions - streams vs batch
Date Mon, 25 Apr 2016 16:58:12 GMT
As usual - thanks for answers, Aljoscha!

I think I understood what I want to know.

1) To add few comments: about streams I was thinking about something
similar to Storm where you can have one Source and 'duplicate' the same
entry going through different 'path's.
Something like this:
And later you can 'join' these separate streams back.
And actually I think this is what I meant:
- this one actually 'joins' by window.

As for 'exact-once-guarantee' I've got the difference from this paper:
- Thanks!

2) understood, thank you very much

I'll probably bother you one more time with another question:

3) Lets say I have a Source which provides raw (i.e. non-keyed) data. And
lets say I need to 'enhance' each entry with some fields which I can take
from a database.
So I define some DbEnhanceOperation

Database query might be expensive - so I would want to
a) batch entries to perform queries
b) be able to have several parallel DbEnhaceOperations so those will not
slow down my whole processing.

I do not see a way to do that?


I cannot go with countWindowAll because of b) - that thing does not support
several streams (correct?)

So I need to create a windowed stream and for that I need to have some key
- Correct? I.e cannot create windows on a stream of general object just
using number of objects.

I probably can 'emulate' keyed stream by providing some 'fake' key. But in
this case I can parallelize only on different keys. Again - it is probably
doable by introducing some AtomicLong key generator at the first place (
this part probably hard to understand - I can share details if necessary)
but still looks like a bit of hack :)

But the general question - if I can implement 3) 'normally' in a flink-way?


On Mon, Apr 25, 2016 at 10:53 AM, Aljoscha Krettek <aljoscha@apache.org>

> Hi,
> I'll try and answer your questions separately. First, a general remark,
> although Flink has the DataSet API for batch processing and the DataStream
> API for stream processing we only have one underlying streaming execution
> engine that is used for both. Now, regarding the questions:
> 1) What do you mean by "parallel into 2 streams"? Maybe that could
> influence my answer but I'll just give a general answer: Flink does not
> give any guarantees about the ordering of elements in a Stream or in a
> DataSet. This means that merging or unioning two streams/data sets will
> just mean that operations see all elements in the two merged streams but
> the order in which we see them is arbitrary. This means that we don't keep
> buffers based on time or size or anything.
> 2) The elements that flow through the topology are not tracked
> individually, each operation just receives elements, updates state and
> sends elements to downstream operation. In essence this means that elements
> themselves don't block any resources except if they alter some kept state
> in operations. If you have a stateless pipeline that only has
> filters/maps/flatMaps then the amount of required resources is very low.
> For a finite data set, elements are also streamed through the topology.
> Only if you use operations that require grouping or sorting (such as
> groupBy/reduce and join) will elements be buffered in memory or on disk
> before they are processed.
> Two answer your last question. If you only do stateless
> transformations/filters then you are fine to use either API and the
> performance should be similar.
> Cheers,
> Aljoscha
> On Sun, 24 Apr 2016 at 15:54 Konstantin Kulagin <kkulagin@gmail.com>
> wrote:
>> Hi guys,
>> I have some kind of general question in order to get more understanding
>> of stream vs final data transformation. More specific - I am trying to
>> understand 'entities' lifecycle during processing.
>> 1) For example in a case of streams: suppose we start with some key-value
>> source, parallel it into 2 streams by key. Each stream modifies entry's
>> values, lets say adds some fields. And we want to merge it back later. How
>> does it happen?
>> Merging point will keep some finite buffer of entries? Basing on time or
>> size?
>> I understand that probably right solution in this case would be having
>> one stream and achieve more more performance by increasing parallelism, but
>> what if I have 2 sources from the beginning?
>> 2) Also I assume that in a case of streaming each entry considered as
>> 'processed' once it passes whole chain and emitted into some sink, so after
>> it will not consume resources. Basically similar to what Storm is doing.
>> But in a case of finite data (data sets): how big amount of data system
>> will keep in memory? The whole set?
>> I probably have some example of dataset vs stream 'mix': I need to
>> *transform* big but finite chunk of data, I don't really need to do any
>> 'joins', grouping or smth like that so I never need to store whole dataset
>> in memory/storage. What my choice would be in this case?
>> Thanks!
>> Konstantin

View raw message