Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
From: Yunus Olgun <yunolgun@gmail.com>
Message-Id: <F9625158-AB9C-4D39-A210-4A6CD0CF0F58@gmail.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_7A228A22-DDB1-4367-8091-E41BB4AB4527"
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
Subject: Re: CustomPartitioner that simulates ForwardPartitioner and
 watermarks
Date: Wed, 27 Sep 2017 18:32:20 +0200
In-Reply-To: <76BB12F2-BF1F-4735-BC19-9D40D89D1BAC@data-artisans.com>
Cc: user@flink.apache.org
To: Kostas Kloudas <k.kloudas@data-artisans.com>
References: <67AFFE27-D7C8-45E7-9B54-9EAB0EE287A1@gmail.com>
 <76BB12F2-BF1F-4735-BC19-9D40D89D1BAC@data-artisans.com>
archived-at: Wed, 27 Sep 2017 16:32:31 -0000


--Apple-Mail=_7A228A22-DDB1-4367-8091-E41BB4AB4527
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi Kostas,

Yes, you have summarized well. I want to only forward the data to the =
next local operator, but broadcast the watermark through the cluster.

- I can=E2=80=99t set parallelism of taskB to 1. The stream is too big =
for that. Also, the data is ordered at each partition. I don=E2=80=99t =
want to change that order.

- I don=E2=80=99t need KeyedStream. Also taskA and taskB will always =
have the same parallelism with each other. But this parallelism can be =
increased in the future.

The use case is: The source is Kafka. At our peak hours or when we want =
to run the streaming job with old data from Kafka, always the same thing =
happens. Even at trivial jobs. Some consumers consumes faster than =
others. They produce too much data to downstream but watermark advances =
slowly at the speed of the slowest consumer. This extra data gets piled =
up at downstream operators. When the downstream operator is an =
aggregation, it is ok. But when it is a in-Flink join; state size gets =
too big, checkpoints take much longer and overall the job becomes slower =
or fails. Also it effects other jobs at the cluster.

So, basically I want to implement a throttler. It compares timestamp of =
a record and the global watermark. If the difference is larger than a =
constant threshold it starts sleeping 1 ms for each incoming record. =
This way, fast operators wait for the slowest one.

The only problem is that, this solution came at the cost of one network =
shuffle and data serialization/deserialization. Since the stream is =
large I want to avoid the network shuffle at the least.=20

I thought operator instances within a taskmanager would get the same =
indexId, but apparently this is not the case.

Thanks,

> On 27. Sep 2017, at 17:16, Kostas Kloudas =
<k.kloudas@data-artisans.com> wrote:
>=20
> Hi Yunus,
>=20
> I am not sure if I understand correctly the question.
>=20
> Am I correct to assume that you want the following?
>=20
> 				=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=
=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94> time
>=20
> 		ProcessA						=
ProcessB
>=20
> Task1: W(3) E(1) E(2) E(5)			W(3) W(7) E(1) E(2) E(5)
>=20
> Task2: W(7) E(3) E(10) E(6)			W(3) W(7) E(3) E(10) =
E(6)
>=20
>=20
> In the above, elements flow from left to right and W() stands for =
watermark and E() stands for element.
> In other words, between Process(TaksA) and Process(TaskB) you want to =
only forward the elements, but broadcast the watermarks, right?
>=20
> If this is the case, a trivial solution would be to set the =
parallelism of TaskB to 1, so that all elements go through the same =
node.
>=20
> One other solution is what you did, BUT by using a custom partitioner =
you cannot use keyed state in your process function B because the=20
> stream is no longer keyed.
>=20
> A similar approach to what you did but without the limitation above, =
is that in the first processFunction (TaskA) you can append the=20
> taskId to the elements themselves and then do a keyBy(taskId) between =
the first and the second process function.
>=20
> These are the solutions that I can come up with, assuming that you =
want to do what I described.
>=20
> But in general, could you please describe a bit more what is your use =
case?=20
> This way we may figure out another approach to achieve your goal.=20
> In fact, I am not sure if you earn anything by broadcasting the =
watermark, other than=20
> re-implementing (to some extent) Flink=E2=80=99s windowing mechanism.
>=20
> Thanks,
> Kostas
>=20
>> On Sep 27, 2017, at 4:35 PM, Yunus Olgun <yunolgun@gmail.com =
<mailto:yunolgun@gmail.com>> wrote:
>>=20
>> Hi,
>>=20
>> I have a simple streaming job such as:
>>=20
>> source.process(taskA)
>>           .process(taskB)
>>=20
>> I want taskB to access minimum watermark of all parallel taskA =
instances, but the data is ordered and should not be shuffled. =
ForwardPartitioner uses watermark of only one predecessor. So, I have =
used a customPartitioner.
>>=20
>> source.process(taskA)
>>           .map(AssignPartitionID)
>>           .partitionCustom(IdPartitioner)
>>           .map(StripPartitionID)
>>           .process(taskB)
>>=20
>> At AssignPartitionID function, I attach =
getRuntimeContext().getIndexOfThisSubtask() as a partitionId to the =
object. At IdPartitioner, I return this partitionId.
>>=20
>> This solved the main requirement but I have another concern now,
>>=20
>> Network shuffle: I don=E2=80=99t need a network shuffle. I thought =
within a taskmanager, indexId of taskA subtasks would be same as indexId =
of taskB subtasks. Unfortunately, they are not. Is there a way to make =
partitionCustom distribute data like ForwardPartitioner, to the next =
local operator?=20
>>=20
>> As I know, this still requires object serialization/deserialization =
since operators can=E2=80=99t be chained anymore. Is there a way to get =
minimum watermark from upstream operators without network shuffle and =
object serilization/deserialization?
>>=20
>> Regards,
>=20


--Apple-Mail=_7A228A22-DDB1-4367-8091-E41BB4AB4527
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">Hi Kostas,<div class=3D""><br class=3D""></div><div =
class=3D"">Yes, you have summarized well. I want to only forward the =
data to the next local operator, but broadcast the watermark through the =
cluster.</div><div class=3D""><br class=3D""></div><div class=3D"">- I =
can=E2=80=99t set parallelism of taskB to 1. The stream is too big for =
that. Also, the data is ordered at each partition. I don=E2=80=99t want =
to change that order.</div><div class=3D""><br class=3D""></div><div =
class=3D"">- I don=E2=80=99t need KeyedStream. Also taskA and taskB will =
always have the same parallelism with each other. But this parallelism =
can be increased in the future.</div><div class=3D""><br =
class=3D""></div><div class=3D"">The use case is: The source is Kafka. =
At our peak hours or when we want to run the streaming job with old data =
from Kafka, always the same thing happens. Even at trivial jobs. Some =
consumers consumes faster than others. They produce too much data to =
downstream but watermark advances slowly at the speed of the slowest =
consumer. This extra data gets piled up at downstream operators. When =
the downstream operator is an aggregation, it is ok. But when it is a =
in-Flink join; state size gets too big, checkpoints take much longer and =
overall the job becomes slower or fails. Also it effects other jobs at =
the cluster.</div><div class=3D""><br class=3D""></div><div class=3D"">So,=
 basically I want to implement a throttler. It compares timestamp of a =
record and the global watermark. If the difference is larger than a =
constant threshold it starts sleeping 1 ms for each incoming record. =
This way, fast operators wait for the slowest one.</div><div =
class=3D""><br class=3D""></div><div class=3D"">The only problem is =
that, this solution came at the cost of one network shuffle and data =
serialization/deserialization. Since the stream is large I want to avoid =
the network shuffle at the least.&nbsp;</div><div class=3D""><br =
class=3D""></div><div class=3D"">I thought operator instances within a =
taskmanager would get the same indexId, but apparently this is not the =
case.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Thanks,</div><div class=3D""><br class=3D""><div><blockquote =
type=3D"cite" class=3D""><div class=3D"">On 27. Sep 2017, at 17:16, =
Kostas Kloudas &lt;<a href=3D"mailto:k.kloudas@data-artisans.com" =
class=3D"">k.kloudas@data-artisans.com</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><meta =
http-equiv=3D"Content-Type" content=3D"text/html charset=3Dutf-8" =
class=3D""><div style=3D"word-wrap: break-word; -webkit-nbsp-mode: =
space; -webkit-line-break: after-white-space;" class=3D"">Hi Yunus,<div =
class=3D""><br class=3D""></div><div class=3D"">I am not sure if I =
understand correctly the question.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Am I correct to assume that you want =
the following?</div><div class=3D""><br class=3D""></div><div =
class=3D""><span class=3D"Apple-tab-span" style=3D"white-space:pre">		=
		</span>=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=
=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94&gt; time</div><div =
class=3D""><br class=3D""></div><div class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>ProcessA<span class=3D"Apple-tab-span" style=3D"white-space:pre">	=
					</span>ProcessB</div><div =
class=3D""><br class=3D""></div><div class=3D"">Task1: W(3) E(1) E(2) =
E(5)<span class=3D"Apple-tab-span" style=3D"white-space:pre">			=
</span>W(3) W(7) E(1) E(2) E(5)</div><div class=3D""><br =
class=3D""></div><div class=3D"">Task2: W(7) E(3) E(10) E(6)<span =
class=3D"Apple-tab-span" style=3D"white-space:pre">			=
</span>W(3) W(7) E(3) E(10) E(6)</div><div class=3D""><br =
class=3D""></div><div class=3D""><br class=3D""></div><div class=3D"">In =
the above, elements flow from left to right and W() stands for watermark =
and E() stands for element.</div><div class=3D"">In other words, between =
Process(TaksA) and Process(TaskB) you want to only <b =
class=3D"">forward</b> the elements, but <b class=3D"">broadcast</b> the =
watermarks, right?</div><div class=3D""><br class=3D""></div><div =
class=3D"">If this is the case, a trivial solution would be to set the =
parallelism of TaskB to 1, so that all elements go through the same =
node.</div><div class=3D""><br class=3D""></div><div class=3D"">One =
other solution is what you did, <b class=3D"">BUT </b>by using a custom =
partitioner you cannot use keyed state in your process function B =
because the&nbsp;</div><div class=3D"">stream is no longer =
keyed.</div><div class=3D""><br class=3D""></div><div class=3D"">A =
similar approach to what you did but without the limitation above, is =
that in the first processFunction (TaskA)&nbsp;you can append =
the&nbsp;</div><div class=3D"">taskId to the elements themselves and =
then do a keyBy(taskId) between the first and the second process =
function.</div><div class=3D""><br class=3D""></div><div class=3D"">These =
are the solutions that I can come up with, assuming that you want to do =
what I described.</div><div class=3D""><br class=3D""></div><div =
class=3D"">But in general, could you please describe a bit more what is =
your use case?&nbsp;</div><div class=3D"">This way we may figure out =
another approach to achieve your goal.&nbsp;</div><div class=3D"">In =
fact, I am not sure if you earn anything by broadcasting the watermark, =
other than&nbsp;</div><div class=3D"">re-implementing (to some extent) =
Flink=E2=80=99s windowing mechanism.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Thanks,</div><div =
class=3D"">Kostas</div><div class=3D""><br class=3D""><div =
class=3D""><blockquote type=3D"cite" class=3D""><div class=3D"">On Sep =
27, 2017, at 4:35 PM, Yunus Olgun &lt;<a =
href=3D"mailto:yunolgun@gmail.com" class=3D"">yunolgun@gmail.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div =
class=3D"">Hi,<br class=3D""><br class=3D"">I have a simple streaming =
job such as:<br class=3D""><br class=3D"">source.process(taskA)<br =
class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.process(taskB=
)<br class=3D""><br class=3D"">I want taskB to access minimum watermark =
of all parallel taskA instances, but the data is ordered and should not =
be shuffled. ForwardPartitioner uses watermark of only one predecessor. =
So, I have used a customPartitioner.<br class=3D""><br =
class=3D"">source.process(taskA)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.map(AssignPar=
titionID)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.partitionCust=
om(IdPartitioner)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.map(StripPart=
itionID)<br class=3D""> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.process(taskB=
)<br class=3D""><br class=3D"">At AssignPartitionID function, I attach =
getRuntimeContext().getIndexOfThisSubtask() as a partitionId to the =
object. At IdPartitioner, I return this partitionId.<br class=3D""><br =
class=3D"">This solved the main requirement but I have another concern =
now,<br class=3D""><br class=3D"">Network shuffle: I don=E2=80=99t need =
a network shuffle. I thought within a taskmanager, indexId of taskA =
subtasks would be same as indexId of taskB subtasks. Unfortunately, they =
are not. Is there a way to make partitionCustom distribute data like =
ForwardPartitioner, to the next local operator? <br class=3D""><br =
class=3D"">As I know, this still requires object =
serialization/deserialization since operators can=E2=80=99t be chained =
anymore. Is there a way to get minimum watermark from upstream operators =
without network shuffle and object serilization/deserialization?<br =
class=3D""><br class=3D"">Regards,</div></div></blockquote></div><br =
class=3D""></div></div></div></blockquote></div><br =
class=3D""></div></body></html>=

--Apple-Mail=_7A228A22-DDB1-4367-8091-E41BB4AB4527--