flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yunus Olgun <yunol...@gmail.com>
Subject Synchronized Kafka sources
Date Fri, 04 Aug 2017 13:23:55 GMT
Hi,

Is it possible to synchronize two kafka sources? So they can consume from
different Kafka topics in close enough event times.

My use case is, I have two Kafka topics: A(very large) and B(large). There
is a mapping of one to one or zero between A and B. Topology is simply join
A and B in a tumbling time window and do aggregations on the joined data.

In real time, there is not a problem. But when I start the job for last
week it becomes very slow. Because, by the time source A consumes 1 minute
of data from Kafka, source B consumes 1 hour of data from Kafka. Since
watermark progresses with the smallest of the parent operators, source B
generates many windows that will stay in the memory to be triggered in the
future. That increases state size. Checkpoints gets bigger and bigger and
the job becomes slower.

I have tried to put an operator after sources which writes event times to
an external source. If a source is far ahead than the other one, it sleeps
for a short time then consume a little bit, then check and sleep again if
it is necessary. This map operator increased checkpoint times much higher.
I guess sleeping at an operator is not a good idea with checkpoint
mechanism.

Is there a way to make two or more sources consume in a synchonized way
from Kafka using Flink?

Mime
View raw message