flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: dynamic streams and patterns
Date Thu, 14 Jul 2016 14:26:16 GMT
Hi Claudia,

1) What do you mean by dynamically adding? In standalone mode (which you
would probably use with Docker images), you can just start additional
TaskManagers, which will connect to a JobManager.
So you could implement some monitoring to start new TaskManagers as soon as
they are needed.
In general, we recommend to start one JobManager per job, but running
multiple jobs per JM is also possible. I don't have much experience with
many concurrent jobs on a JM, but in theory there are no limits.
In practice you'll probably run into stability issues at some point,
because the JM needs to coordinate too many jobs / taskmanagers.

2) Yes, that would be an option. The most important aspects here are: data
throughput per admin group / state size / analysis complexity.
If the each administrative group is low traffic (~100.000 elements /
second), you could maybe process the data not using a Flink cluster at all.
The Standalone mode of Flink starts a JobManager and TaskManager within the
same JVM. You could prepare a docker image with a standalone flink + the
job and start that per administrative group. I think a reasonably sized
machine (8 cores, 32 gb of main memory) should handle that.

3) Yes, you can not modify a running job. You can follow the King.com /
RBEA approach.

4) That depends on the the definition of great.

I think above answers greatly depend on the expected amount of data and the
available hardware. Since Flink is quite easy to deploy, and a simple
testing job is implemented in a few hours, I would suggest to do some
experiments to see how Flink behaves in the given environment.

Regards,
Robert




On Mon, Jul 11, 2016 at 9:39 AM, Claudia Wegmann <c.wegmann@kasasi.de>
wrote:

> Hey everyone,
>
>
>
> I’m quite new to Apache Flink. I’m trying to build a system with Flink and
> wanted to hear your opinion and whether the proposed architecture is even
> possible with Flink. The environment for the system will be a microservice
> architecture handling messaging via async events.
>
>
>
> I want to give you a brief description of the system:
>
> -          there are a lot of sensors, which each produces a stream of
> data
>
> -          on each stream of sensor data I want to match one or more
> patterns via Flink’s CEP library
>
> -          each of these sensors belongs to one or more administrative
> entities
>
> -          each pattern belongs to one administrative entity and needs to
> be evaluated on one or more sensors of this entity
>
> -          the user can change the connection of a sensor to an
> administrative entity as well as the sensors on which a pattern needs to be
> evaluated
>
>
>
> I hope this description is enough to give you an overview of the system.
>
>
>
> This is what I am thinking of doing:
>
> -          I will have an Apache Kafka cluster and a Flink cluster
> running inside docker containers
>
> -          I create a topic in Kafka for each administrative entity
>
> -          for each entity I create a Flink job which consumes the
> corresponding topic
>
> -          the Flink job creates a stream of the sensor data
>
> -          it splits the stream to a stream for each sensor
>
> -          for each pattern that hast to be evaluated on one stream I
> create a pattern stream
>
>
>
> This results in the following:
>
> -          there will be a lot of Kafka topics
>
> -          for each topic there will be one Flink job (-> a lot of jobs,
> too)
>
> -          in each job there will be quite a lot of streams and patterns
> and therefore even more pattern streams
>
>
>
> The main questions that arose while thinking of this implementation:
>
> 1.)    From other questions here, I know that there is currently no way
> to dynamically add taskmanagers to the Flink cluster. The proposed way to
> handle that, is to start up much more taskmanagers than first needed. Is it
> even possible to have a great number of jobs on one cluster?
>
> 2.)    Would a viable alternative be to just dynamically start up a new
> cluster for each administrative entity?
>
> 3.)    I also came to know, that Flink isn’t able to handle dynamically
> created streams and patterns. I guess that is due to the fixed calculation
> of the execution graph at the jobs beginning. Is there a way to make Flink
> recalculate the graph of a running job? I also just found out about this
> [1] example, where they use scripts to hot deploy queries. I will look into
> that, too. Maybe that provides an acceptable solution for me, too.
>
> 4.)    Is it even possible to have a great number of streams and patterns
> in one Flink job?
>
>
>
> Any comments and feedback are greatly appreciated.
>
> Thanks a lot in advance J
>
> Best, Claudia
>
>
>
> [1]: https://techblog.king.com/rbea-scalable-real-time-analytics-king/
>
>
>

Mime
View raw message