flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Knauf <konstan...@ververica.com>
Subject Re: How to build dependencies and connections between stream jobs?
Date Mon, 03 Jun 2019 08:01:20 GMT
Hi Henry,

Apache Kafka or other message queue like Apache Pulsar or AWS Kinesis are
in general the most common way to connect multiple streaming jobs. The
dependencies between streaming jobs are in my experience of a different
nature though. For batch jobs, it makes sense to schedule one after the
other or having more complicated relationships. Streaming jobs are all
processing data continuously, so the "coordination" happens on a different
level.

To avoid duplication, you can use the Kafka exactly-once sink, but this
comes with a latency penalty (transactions are only committed on checkpoint
completion).

Generally, I would advise to always attach meaningful timestamps to your
records, so that you can use watermarking [1] to trade off between latency
and completeness. These could also be used to identify late records
(resulting from catch up after recovers), which should be ignored by
downstream jobs.

There are other users, who assign a unique ID to every message going
through there system and only use idempotent operations (set operations)
within Flink, because messages are sometimes already duplicated before
reaching the stream processor. For downstream jobs, where an upstream job
might duplicate records, this could be a viable, yet limiting, approach as
well.

Hope this helps and let me know, what you think.

Cheers,

Konstantin







[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#event-time-and-watermarks





On Thu, May 30, 2019 at 11:39 AM 徐涛 <happydexutao@gmail.com> wrote:

> Hi Experts,
>         In batch computing, there are products like Azkaban or airflow to
> manage batch job dependencies. By using the dependency management tool, we
> can build a large-scale system consist of small jobs.
>         In stream processing, it is not practical to put all dependencies
> in one job, because it will make the job being too complicated, and the
> state is too large. I want to build a large-scale realtime system which is
> consist of many Kafka sources and many streaming jobs, but the first thing
> I can think of is how to build the dependencies and connections between
> streaming jobs.
>         The only method I can think of is using a self-implemented retract
> Kafka sink, each streaming job is connected by Kafka topic. But because
> each job may fail and retry, for example, the message in Kafka topic may
> look like this:
>         { “retract”:”false”, “id”:”1”, “amount”:100 }
>         { “retract”:”false”, “id”:”2”, “amount”:200 }
>         { “retract”:”true”, “id”:”1”, “amount”:100 }
>         { “retract”:”true”, “id”:”2”, “amount”:200 }
>         { “retract”:”false”, “id”:”1”, “amount”:100 }
>         { “retract”:”false”, “id”:”2”, “amount”:200 }
>         if the topic is “topic_1”, the SQL in the downstream job may look
> like this:
>                 select
>                         id, latest(amount)
>                 from topic_1
>                 where retract=“false"
>                 group by id
>
>         But it will also make big state because each id is being grouped.
>         I wonder if using Kafka to connect streaming jobs is applicable,
> how to build a large-scale realtime system consists of many streaming job?
>
>         Thanks a lot.
>
> Best
> Henry



-- 

Konstantin Knauf | Solutions Architect

+49 160 91394525


Planned Absences: 7.6.2019, 20. - 21.06.2019


<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Mime
View raw message