flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer W <sam...@axiomine.com>
Subject Re: Multiple Partitions (Source Functions) -> Event Time -> Watermarks -> Trigger
Date Wed, 10 Aug 2016 23:05:41 GMT
Sorry for replying to my own messages but this is super confusing and
logical at the same time to me :-).

If I have Kafka Topic with 10 partitions. If I partition by device id when
I write to the Topic, and use Event Time, my pipeline freezes (if fewer
than 10 devices are active initially). Because if some partitions are
inactive (only a few devices active at a time) they do not send watermarks
and my pipeline waits forever for those partitions to send in their
watermarks even if the keyBy is on the device id whose records are going to
come from only one partition.

When I send records to Kafka randomly (to any partition) the pipeline works
fine as all partitions (sources connected to them) are sending watermarks.

This gets even more confusing if I apply watermarks and timestamps
downstream after a KeyBy operation which is again followed by another keyBy
which does not receive events for a key from all the upstream operators.
Again nothing fires as Flink expects other map operators (to which the
watermark assignment is piped) to send in the watermarks as well.

My conclusion: Only produce watermarks at the source function. Is this
valid or am I missing something? Because only when I do that (and random
allocation of events to partitions in Kafka) the whole pipeline works

*If there a way to set a timeout - If watermarks from source functions are
not received within a certain time interval, fire the time windows.*


On Wed, Aug 10, 2016 at 3:27 PM, Sameer W <sameer@axiomine.com> wrote:

> And this is happening in my local environment. As soon as I set the
> parallelism to 1 it all works fine.
> Sameer
> On Wed, Aug 10, 2016 at 3:11 PM, Sameer W <sameer@axiomine.com> wrote:
>> Hi,
>> I am noticing this behavior with Event Time processing-
>> I have a Kafka topic with 10 partitions. Each Event Source sends data to
>> any one of the partitions. Say I have only 1 event source active at this
>> moment, which means only one partition is receiving data.
>> None of my windows will fire now because the 9 partitions (source
>> function instances) are not sending any watermarks and Flink waits forever.
>> I go to topic with 1 partition but leave default parallelism intact. Only
>> one Mapper instance contributes to the subsequent keyBy operation but other
>> 7 (assuming 8 of default parallelism) are idle. I assign watermarks after
>> the map function. Again the same behavior because the 7 other mappers are
>> not sending watermarks.
>> How do I handle this? Not all of my partitions are going to be receiving
>> data at all times using this partitioning strategy. Or I have to use random
>> partitioning which will also work.
>> Thanks,
>> Sameer

View raw message