flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Nowojski <pi...@data-artisans.com>
Subject Re: Correlation between data streams/operators and threads
Date Mon, 13 Nov 2017 13:51:53 GMT
Sure, let us know if you have other questions or encounter some issues.

Thanks, Piotrek

> On 13 Nov 2017, at 14:49, Shailesh Jain <shailesh.jain@stellapps.com> wrote:
> 
> Thanks, Piotr. I'll try it out and will get back in case of any further questions.
> 
> Shailesh
> 
> On Fri, Nov 10, 2017 at 5:52 PM, Piotr Nowojski <piotr@data-artisans.com <mailto:piotr@data-artisans.com>>
wrote:
> 1.  It’s a little bit more complicated then that. Each operator chain/task will be
executed in separate thread (parallelism
>  Multiplies that). You can check in web ui how was your job split into tasks.
> 
> 3. Yes that’s true, this is an issue. To preserve the individual watermarks/latencies
(assuming that you have some way to calculate them individually per each device), you could
either:
> 
> a) have separate jobs per each device with parallelism 1. Pros: independent failures/checkpoints,
Cons: resource usage (number of threads increases with number of devices, there are also other
resources consumed by each job), efficiency, 
> b) have one job with multiple data streams. Cons: resource usage (threads)
> c) ignore Flink’s watermarks, and implement your own code in place of it. You could
read all of your data in single data stream, keyBy partition/device and manually handle watermarks
logic. You could either try to wrap CEP/Window operators or copy/paste and modify them to
suite your needs. 
> 
> I would start and try out from a). If it work for your cluster/scale then that’s fine.
If not try b) (would share most of the code with a), and as a last resort try c).
> 
> Kostas, would you like to add something?
> 
> Piotrek
> 
>> On 9 Nov 2017, at 19:16, Shailesh Jain <shailesh.jain@stellapps.com <mailto:shailesh.jain@stellapps.com>>
wrote:
>> 
>> On 1. - is it tied specifically to the number of source operators or to the number
of Datastream objects created. I mean does the answer change if I read all the data from a
single Kafka topic, get a Datastream of all events, and the apply N filters to create N individual
streams?
>> 
>> On 3. - the problem with partitions is that watermarks cannot be different per partition,
and since in this use case, each stream is from a device, the latency could be different (but
order will be correct almost always) and there are high chances of loosing out on events on
operators like Patterns which work with windows. Any ideas for workarounds here?
>> 
>> 
>> Thanks,
>> Shailesh
>> 
>> On 09-Nov-2017 8:48 PM, "Piotr Nowojski" <piotr@data-artisans.com <mailto:piotr@data-artisans.com>>
wrote:
>> Hi,
>> 
>> 1. 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/parallel.html <https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/parallel.html>
>> 
>> Number of threads executing would be roughly speaking equal to of the number of input
data streams multiplied by the parallelism.
>> 
>> 2. 
>> Yes, you could dynamically create more data streams at the job startup.
>> 
>> 3.
>> Running 10000 independent data streams on a small cluster (couple of nodes) will
definitely be an issue, since even with parallelism set to 1, there would be quite a lot of
unnecessary threads. 
>> 
>> It would be much better to treat your data as a single data input stream with multiple
partitions. You could assign partitions between source instances based on parallelism. For
example with parallelism 6:
>> - source 0 could get partitions 0, 6, 12, 18
>> - source 1, could get partitions 1, 7, …
>> …
>> - source 5, could get partitions 5, 11, ...
>> 
>> Piotrek
>> 
>>> On 9 Nov 2017, at 10:18, Shailesh Jain <shailesh.jain@stellapps.com <mailto:shailesh.jain@stellapps.com>>
wrote:
>>> 
>>> Hi,
>>> 
>>> I'm trying to understand the runtime aspect of Flink when dealing with multiple
data streams and multiple operators per data stream.
>>> 
>>> Use case: N data streams in a single flink job (each data stream representing
1 device - with different time latencies), and each of these data streams gets split into
two streams, of which one goes into a bunch of CEP operators, and one into a process function.
>>> 
>>> Questions:
>>> 1. At runtime, will the engine create one thread per data stream? Or one thread
per operator?
>>> 2. Is it possible to dynamically create a data stream at runtime when the job
starts? (i.e. if N is read from a file when the job starts and corresponding N streams need
to be created)
>>> 3. Are there any specific performance impacts when a large number of streams
(N ~ 10000) are created, as opposed to N partitions within a single stream?
>>> 
>>> Are there any internal (design) documents which can help understanding the implementation
details? Any references to the source will also be really helpful.
>>> 
>>> Thanks in advance.
>>> 
>>> Shailesh
>>> 
>>> 
>> 
>> 
> 
> 


Mime
View raw message