flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: Flink Kafka connector consume from a single kafka partition
Date Fri, 21 Feb 2020 11:17:20 GMT
Hey Hemant,

Are you able to reconstruct the ordering of the event, for example based on
time or some sequence number?
If so, you could create as many Kafka partitions as you need (for proper
load distribution), disregarding any ordering at that point.
Then you keyBy your stream in Flink, and order it within a window operator
(or some custom logic in a process function)
Flink is able to handle quite large states using the RocksDB statebackend.


On Wed, Feb 19, 2020 at 6:34 PM hemant singh <hemant2184@gmail.com> wrote:

> Hi Arvid,
> Thanks for your response. I think I did not word my question properly.
> I wanted to confirm that if the data is distributed to more than one
> partition then the ordering cannot be maintained (which is documented).
> According to your response I understand if I set the parallelism to number
> of partition then each consumer will consume from one partition and
> ordering can be maintained.
> However, I have a question here in case my parallelism is less than number
> of partitions still I believe if I create keyedstream ordering will be
> maintained at operator level for that key. Correct me if I am wrong.
> Second, one issue/challenge which I see with this model is one of the
> source's frequency of pushing data is very high then one partition is
> overloaded. Hence the task which process this will be overloaded too,
> however for maintaining ordering I do not have any other options but to
> maintain data in one partition.
> Thanks,
> Hemant
> On Wed, Feb 19, 2020 at 5:54 PM Arvid Heise <arvid@ververica.com> wrote:
>> Hi Hemant,
>> Flink passes your configurations to the Kafka consumer, so you could
>> check if you can subscribe to only one partition there.
>> However, I would discourage that approach. I don't see the benefit to
>> just subscribing to the topic entirely and have dedicated processing for
>> the different devices.
>> If you are concerned about the order, you shouldn't. Since all events of
>> a specific device-id reside in the same source partition, events are
>> in-order in Kafka (responsibility of producer, but I'm assuming that
>> because of your mail) and thus they are also in order in non-keyed streams
>> in Flink. Any keyBy on device-id or composite key involving device-id,
>> would also retain the order.
>> If you have exactly one partition per device-id, you could even go with
>> `DataStreamUtil#reinterpretAsKeyedStream` to avoid any shuffling.
>> Let me know if I misunderstood your use case or if you have further
>> questions.
>> Best,
>> Arvid
>> On Wed, Feb 19, 2020 at 8:39 AM hemant singh <hemant2184@gmail.com>
>> wrote:
>>> Hello Flink Users,
>>> I have a use case where I am processing metrics from different type of
>>> sources(one source will have multiple devices) and for aggregations as well
>>> as build alerts order of messages is important. To maintain customer data
>>> segregation I plan to have single topic for each customer with each source
>>> stream data to one kafka partition.
>>> To maintain ordering I am planning to push data for a single source type
>>> to single partitions. Then I can create keyedstream so that each of the
>>> device-id I have a single stream which has ordered data for each device-id.
>>> However, flink-kafka consumer I don't see that I can read from a
>>> specific partition hence flink consumer read from multiple kafka
>>> partitions. So even if I try to create a keyedstream on source type(and
>>> then write to a partition for further processing like keyedstream on
>>> device-id) I think ordering will not be maintained per source type.
>>> Only other option I feel I am left with is have single partition for the
>>> topic so that flink can subscribe to the topic and this maintains the
>>> ordering, the challenge is too many topics(as I have this configuration for
>>> multiple customers) which is not advisable for a kafka cluster.
>>> Can anyone shed some light on how to handle this use case.
>>> Thanks,
>>> Hemant

View raw message