flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Junguk Cho <jman...@gmail.com>
Subject Question about Flink internals
Date Wed, 06 Sep 2017 19:54:59 GMT
Hi, All.

I am new to Flink.
I just installed Flink in clusters and start reading documents to
understand Flink internals.
After reading some documents, I have some questions.
I have some experiences of Storm and Heron before, so I am linking their
mechanisms to questions to better understand Flink.

1. Can I specify worker parallelism explicitly like Storm?

2. Record in Flink
Can I think a "record" in FLINK is almost same as Tuple in Storm?
Tuple in Storm is used for carrying "real data" + "metadata (e.g., stream
type, source id and so on).

3. How does partition (e.g., shuffling,  map) works internally?
In Storm, it has (worker id) : (tcp info to next workers) tables.
So, based on this information, after executing partition function, Tuple is
 forwarded to next hops based on tables.
Is it the same?

4. How does Flink detect fault in case of worker dead machine failure?
Based on documents, Job manager checks liveness of task managers with
heartbeat message.
In Storm, supervisor (I think it is similar with Task manager) first
detects worker dead based on heartbeat and locally re-runs it again. For
machine failure, Nimbus (I think it is similar with Job manager) detects
machine failure based on supervisor's heartbeat and re-schedule all
assigned worker to other machine.
How does Flink work?

5. For exactly-once delivery, Flink uses checking point and record replay
mechanism.
It needs messages queues (e.g, Kafka) for record replay.
Kafka uses TCP to send and receive data. So I wonder if data source does
not use TCP (e.g., IoT sensors), what is general solutions to use record
replay?
For example, source workers are directly connected to several inputs (e.g.,
IoT sensors) while I think it is not normal deployments.

6. Flink supports Cycles.
However,  based on documents, Cycled tasks act as regular dataflow source
and sink respectively, yet they are collocated in the same physical
instance to share an in-memory buffer and thus, implement loopback stream
transparently.
So, what if the number of workers which make cycles is high? It would be
hard to put them in the same physical machine.

Thanks,
Junguk

Mime
View raw message