flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Ruvinskiy <ray.ruvins...@arcticwolf.com>
Subject Re: At what point do watermarks get injected into the stream?
Date Sun, 11 Jun 2017 15:22:23 GMT
Thanks for the explanation, Fabian.

Suppose I have a parallel source that does not inject watermarks, and the first operation
on the DataStream is assignTimestampsAndWatermarks. Does each parallel task that makes up
the source independently inject watermarks for the records that it has read? Suppose I then
call keyBy and a shuffle ensues. Will the resulting partitions after the shuffle have interleaved
watermarks from the various source tasks?

More concretely, suppose s source has a degree of parallelism of two. One of the source tasks
injects the watermarks 2 and 5, while the other injects 3 and 10. There is then a shuffle,
creating two different partitions. Will all the watermarks be broadcast to all the partitions?
Or is it possible for, say, one partition to end up with watermarks 2 and 10 and another with
3 and 5? And after the shuffle, how do we ensure that the watermarks are processed in order
by the operators receiving them?

Thanks,

Ray

From: Fabian Hueske <fhueske@gmail.com>
Date: Saturday, June 10, 2017 at 3:56 PM
To: Ray Ruvinskiy <ray.ruvinskiy@arcticwolf.com>
Cc: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: At what point do watermarks get injected into the stream?

Hi Ray,
in principle, watermarks can be injected anywhere in a stream by calling DataStream.assignTimestampsAndWatermarks().
However, timestamps are usually injected as soon as possible after a stream in ingested (before
the first shuffle). The reason is that watermarks depend on the order of events (and their
timestamps) in the stream. While Flink guarantees the order of events within a partition,
a shuffle interleaves events of different partitions in an unpredictable way such that it
is not possible to reason about the order of timestamps afterwards.
The most common way to inject watermarks is directly inside of a SourceFunction or with a
TimestampAssigner before the first shuffle.
Best, Fabian

2017-06-09 0:46 GMT+02:00 Ray Ruvinskiy <ray.ruvinskiy@arcticwolf.com<mailto:ray.ruvinskiy@arcticwolf.com>>:
I’m trying to build a mental model of how watermarks get injected into the stream. Suppose
I have a stream with a parallel source, and I’m running a cluster with multiple task managers.
Does each parallel source reader inject watermarks, which are then forwarded to downstream
consumers and shuffled between task managers? Or are watermarks created after the shuffle,
when the stream records reach their destined task manager and right before they’re processed
by the operator?

Thanks,

Ray

Mime
View raw message