flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elias Levy <fearsome.lucid...@gmail.com>
Subject Re: Kafka consumer are too fast for some partitions in "flatMap" like jobs
Date Tue, 29 Aug 2017 16:10:33 GMT
How many partitions does the output topic have?  If it has the same number
of partitions as the input topic (30), have you considered simply using a
custom partitioner for the Kafka sink that uses the input partition number
as the output partition number?  If the input messages are ordered per
input partition, that would guarantee their order in the output partitions.

On Tue, Aug 29, 2017 at 1:54 AM, Oleksandr Baliev <aleksanderbalev@gmail.com
> wrote:

> Hello,
> There is one Flink job which consumes from Kafka topic (TOPIC_IN), simply
> flatMap / map data and push to another Kafka topic (TOPIC_OUT).
> TOPIC_IN has around 30 partitions, data is more or less sequential per
> partition and the job has parallelism 30. So in theory there should be 1:1
> mapping between consumer and partition.
> But it's often to see big lag in offsets for some partitions. So that
> should mean that some of consumers are slower than another (i.e. some
> network issues for particular broker host or anything else). So data in
> TOPIC_OUT partitions is distributed but not sequential at all.
> So when some another flink job consumes from TOPIC_OUT and uses
> BoundedOutOfOrdernessTimestampExtractor to generate watermarks, due to
> difference in data timestamps, there can be a lot of late data. Maybe
> something is missing of course in this setup or there is more good approach
> for such flatMap / map jobs.
> Setting big WindowedStream#allowedLateness or giving more time for
> BoundedOutOfOrdernessTimestampExtractor will increase memory consumption
> and probably will cause another issues and anyway there can be late data
> which is not good for later windows.
> One of the solution is to have some shared place, to synchronize lower
> timestamp between consumers and somehow slow down consumption (Thread
> sleep, wait, while loop with condition...).
> 0. Is there any good approach to handle such "Kafka <-  flatMap / map ->
> Kafka" tasks? so data in TOPIC_OUT will be sequential as in TOPIC_IN.
> 1. As far as I see it should be common problem with some slow consumers
> for big Kafka topic with a lot of partitions, isn't it? How Flink/Kafka
> hadle it?
> 2. Does somebody know, is there any mechanism in Flink - Kafka,
> (backpreassure?), which can tell from child operator (some process function
> for example) to specific fast consumers to slow down a bit? Is something
> like callback possible in Flink, don't think so, but..?
> 3. Or is there in Flink already anything which can help to synchronize
> minimum timestamps between consumers and?
> 4. Is there any good approach to slow down consumption in Kafka consumer?
> There should be some problems between session timeout and poll I think or
> something related to that, but maybe there is already some good solution
> for that :)
> Will be glad if somebody can give some hints for any of the questions,
> Best,
> Sasha

View raw message