Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <FDA20EA3-7024-4DFA-9E4D-E30BFE37FE03@tngtech.com>
References: <ED3C823E-92F2-4704-9075-321128DBE3BD@tngtech.com>
 <CAGr9p8B2Bk0MDgOnJ7XtBjQyi=F-0cShE98GcyXm-N7+SbKy+g@mail.gmail.com>
 <FDA20EA3-7024-4DFA-9E4D-E30BFE37FE03@tngtech.com>
From: Robert Metzger <rmetzger@apache.org>
Date: Fri, 22 Jan 2016 14:34:42 +0100
Message-ID: 
 <CAGr9p8DFNAO9PXjyh8JMWO3Dy_qtq732F=VNL=vSia4kx63OLA@mail.gmail.com>
Subject: Re: Backpressure in the context of JDBCOutputFormat update
To: "user@flink.apache.org" <user@flink.apache.org>
Content-Type: multipart/alternative; boundary=001a11c3726a42cab10529ec49dd

--001a11c3726a42cab10529ec49dd
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi,

have you thought about making two independent jobs out of this? (or you
call execute() for the two separate parts)
One job for the update() and one for the insert() ?

Even though the update operation should not be expensive, I think its
helpful to understand the performance impact of having concurrent insert /
updates vs executing these operations sequentially ?
Are the inserts / updates performed on the same table?


On Thu, Jan 21, 2016 at 4:17 PM, Maximilian Bode <
maximilian.bode@tngtech.com> wrote:

> Hi Robert,
> sorry, I should have been clearer in my initial mail. The two cases I was
> comparing are:
>
> 1) distinct() before Insert (which is necessary as we have a unique key
> constraint in our database), no distinct() before update
> 2) distinct() before insert AND distinct() before update
>
> The test data used actually only contains unique values for the affected
> field though, so the dataset size is not reduced in case 2.
>
> In case 1 the insert does not start until all the data has arrived at
> distinct() while the update is already going along (slowing down upstream
> operators as well). In case 2 both sinks wait for their respective
> distinct()'s (which is reached much faster now), then start roughly at th=
e
> same time leading to a shorter net job time for job 2 as compared to 1.
>
> A pause operator might be useful, yes.
>
> The update should not be an inherently much more expensive operation, as
> the WHERE clause only contains the table's primary key.
>
> Cheers,
> Max
> =E2=80=94
> Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com * 0176
> 1000 75 50
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring
> Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert Da=
hlke
> Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082
>
> Am 21.01.2016 um 15:57 schrieb Robert Metzger <rmetzger@apache.org>:
>
> Hi Max,
>
> is the distinct() operation reducing the size of the DataSet? If so, I
> assume you have an idempotent update and the job is faster because fewer
> updates are done?
> if the distinct() operator is not changing anything, then, the job might
> be faster because the INSERT is done while Flink is still executing the
> distinct() operation. So the insert is over when the updates are starting=
.
> This would mean that concurrent inserts and updates on the database are
> much slower than doing this sequentially.
>
> I'm wondering if there is a way in Flink to explicitly ask for spilling a=
n
> intermediate operator to "pause" execution:
>
> Source ----- > (spill for pausing) ---> (update sink)
>         \
>          ------- > (insert)
>
> I don't have a lot of practical experience with RDBMS, but I guess update=
s
> are slower because an index lookup + update is necessary. Maybe optimizin=
g
> the database configuration / schema / indexes is more promising. I think
> its indeed much nicer to avoid any unnecessary steps in Flink.
>
> Did you do any "microbenchmarks" for the update and insert part? I guess
> that would help a lot to understand the impact of certain index structure=
s,
> batching sizes, or database drivers.
>
> Regards,
> Robert
>
>
>
>
> On Thu, Jan 21, 2016 at 3:35 PM, Maximilian Bode <
> maximilian.bode@tngtech.com> wrote:
>
>> Hi everyone,
>>
>> in a Flink (0.10.1) job with two JDBCOutputFormat sinks, one of them
>> (doing a database update) is performing slower than the other one (an
>> insert). The job as a whole is also slow as upstream operators are slowe=
d
>> down due to backpressure. I am able to speed up the whole job by
>> introducing an a priori unnecessary .distinct(), which of course blocks
>> downstream execution of the slow sink, which in turn seems to be able to
>> execute faster when given all data at once.
>>
>> Any ideas what is going on here? Is there something I can do without
>> introducing unnecessary computation steps?
>>
>> Cheers,
>> Max
>> =E2=80=94
>> Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com * 0176
>> 1000 75 50
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring
>> Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert D=
ahlke
>> Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082
>>
>>
>
>

--001a11c3726a42cab10529ec49dd
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>have you thought about making two i=
ndependent jobs out of this? (or you call execute() for the two separate pa=
rts)</div><div>One job for the update() and one for the insert() ?</div><di=
v><br></div><div>Even though the update operation should not be expensive, =
I think its helpful to understand the performance impact of having concurre=
nt insert / updates vs executing these operations sequentially ?</div><div>=
Are the inserts / updates performed on the same table?</div><div><br></div>=
<div><br></div><div><br></div><div><br></div></div><div class=3D"gmail_extr=
a"><br><div class=3D"gmail_quote">On Thu, Jan 21, 2016 at 4:17 PM, Maximili=
an Bode <span dir=3D"ltr">&lt;<a href=3D"mailto:maximilian.bode@tngtech.com=
" target=3D"_blank">maximilian.bode@tngtech.com</a>&gt;</span> wrote:<br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">Hi Robert,=
<div>sorry, I should have been clearer in my initial mail. The two cases I =
was comparing are:</div><div><br></div><div>1) distinct() before Insert (wh=
ich is necessary as we have a unique key constraint in our database), no di=
stinct() before update</div><div>2) distinct() before insert AND distinct()=
 before update</div><div><br></div><div>The test data used actually only co=
ntains unique values for the affected field though, so the dataset size is =
not reduced in case 2.</div><div><br></div><div>In case 1 the insert does n=
ot start until all the data has arrived at distinct() while the update is a=
lready going along (slowing down upstream operators as well). In case 2 bot=
h sinks wait for their respective distinct()&#39;s (which is reached much f=
aster now), then start roughly at the same time leading to a shorter net jo=
b time for job 2 as compared to 1.</div><div><br></div><div>A pause operato=
r might be useful, yes.</div><div><br></div><div>The update should not be a=
n inherently much more expensive operation, as the WHERE clause only contai=
ns the table&#39;s primary key.</div><div><br></div><div>Cheers,</div><div>=
Max</div><div><span class=3D""><div>
<div style=3D"color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-=
indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wra=
p:break-word"><div style=3D"color:rgb(0,0,0);letter-spacing:normal;text-ali=
gn:start;text-indent:0px;text-transform:none;white-space:normal;word-spacin=
g:0px;word-wrap:break-word"><div style=3D"color:rgb(0,0,0);letter-spacing:n=
ormal;text-align:start;text-indent:0px;text-transform:none;white-space:norm=
al;word-spacing:0px;word-wrap:break-word"><div style=3D"color:rgb(0,0,0);le=
tter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;wh=
ite-space:normal;word-spacing:0px;word-wrap:break-word"><div style=3D"color=
:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-tra=
nsform:none;white-space:normal;word-spacing:0px;word-wrap:break-word"><div =
style=3D"color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-inden=
t:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:bre=
ak-word">=E2=80=94=C2=A0</div><div style=3D"color:rgb(0,0,0);letter-spacing=
:normal;text-align:start;text-indent:0px;text-transform:none;white-space:no=
rmal;word-spacing:0px;word-wrap:break-word">Maximilian Bode * Junior Consul=
tant * <a href=3D"mailto:maximilian.bode@tngtech.com" target=3D"_blank">max=
imilian.bode@tngtech.com</a> * <a href=3D"tel:0176%201000%2075%2050" value=
=3D"+4917610007550" target=3D"_blank">0176 1000 75 50</a><div>TNG Technolog=
y Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring</div><div>Gesch=C3=
=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke</div>=
<div>Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082</div><=
/div></div></div></div></div></div>
</div>
<br></span><div><div class=3D"h5"><div><blockquote type=3D"cite"><div>Am 21=
.01.2016 um 15:57 schrieb Robert Metzger &lt;<a href=3D"mailto:rmetzger@apa=
che.org" target=3D"_blank">rmetzger@apache.org</a>&gt;:</div><br><div><div =
dir=3D"ltr">Hi Max,<div><br></div><div>is the distinct() operation reducing=
 the size of the DataSet? If so, I assume you have an idempotent update and=
 the job is faster because fewer updates are done?</div><div>if the distinc=
t() operator is not changing anything, then, the job might be faster becaus=
e the INSERT is done while Flink is still executing the distinct() operatio=
n. So the insert is over when the updates are starting. This would mean tha=
t concurrent inserts and updates on the database are much slower than doing=
 this sequentially.</div><div><br></div><div>I&#39;m wondering if there is =
a way in Flink to explicitly ask for spilling an intermediate operator to &=
quot;pause&quot; execution:</div><div><br></div><div><font face=3D"monospac=
e, monospace">Source ----- &gt; (spill for pausing) ---&gt; (update sink)</=
font></div><div><font face=3D"monospace, monospace">=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 \</font></div><div><font face=3D"monospace, monospace">=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0------- &gt; (insert)</font></div><div><br></div><div>=
I don&#39;t have a lot of practical experience with RDBMS, but I guess upda=
tes are slower because an index lookup + update is necessary. Maybe optimiz=
ing the database configuration / schema / indexes is more promising. I thin=
k its indeed much nicer to avoid any unnecessary steps in Flink.</div><div>=
<br></div><div>Did you do any &quot;microbenchmarks&quot; for the update an=
d insert part? I guess that would help a lot to understand the impact of ce=
rtain index structures, batching sizes, or database drivers.</div><div><br>=
</div><div>Regards,</div><div>Robert</div><div><br></div><div><br></div><di=
v><br></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote"=
>On Thu, Jan 21, 2016 at 3:35 PM, Maximilian Bode <span dir=3D"ltr">&lt;<a =
href=3D"mailto:maximilian.bode@tngtech.com" target=3D"_blank">maximilian.bo=
de@tngtech.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
 style=3D"word-wrap:break-word">Hi everyone,<div><br></div><div>in a Flink =
(0.10.1) job with two JDBCOutputFormat sinks, one of them (doing a database=
 update) is performing slower than the other one (an insert). The job as a =
whole is also slow as upstream operators are slowed down due to backpressur=
e. I am able to speed up the whole job by introducing an a priori unnecessa=
ry .distinct(), which of course blocks downstream execution of the slow sin=
k, which in turn seems to be able to execute faster when given all data at =
once.</div><div><br></div><div>Any ideas what is going on here? Is there so=
mething I can do without introducing unnecessary computation steps?</div><d=
iv><br></div><div>Cheers,</div><div>Max<br><div>
<div style=3D"letter-spacing:normal;text-align:start;text-indent:0px;text-t=
ransform:none;white-space:normal;word-spacing:0px;word-wrap:break-word"><di=
v style=3D"letter-spacing:normal;text-align:start;text-indent:0px;text-tran=
sform:none;white-space:normal;word-spacing:0px;word-wrap:break-word"><div s=
tyle=3D"letter-spacing:normal;text-align:start;text-indent:0px;text-transfo=
rm:none;white-space:normal;word-spacing:0px;word-wrap:break-word"><div styl=
e=3D"letter-spacing:normal;text-align:start;text-indent:0px;text-transform:=
none;white-space:normal;word-spacing:0px;word-wrap:break-word"><div style=
=3D"letter-spacing:normal;text-align:start;text-indent:0px;text-transform:n=
one;white-space:normal;word-spacing:0px;word-wrap:break-word"><div style=3D=
"letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none=
;white-space:normal;word-spacing:0px;word-wrap:break-word">=E2=80=94=C2=A0<=
/div><div style=3D"letter-spacing:normal;text-align:start;text-indent:0px;t=
ext-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word=
">Maximilian Bode * Junior Consultant * <a href=3D"mailto:maximilian.bode@t=
ngtech.com" target=3D"_blank">maximilian.bode@tngtech.com</a> * <a href=3D"=
tel:0176%201000%2075%2050" value=3D"+4917610007550" target=3D"_blank">0176 =
1000 75 50</a><div>TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unte=
rf=C3=B6hring</div><div>Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christop=
h Stock, Dr. Robert Dahlke</div><div>Sitz: Unterf=C3=B6hring * Amtsgericht =
M=C3=BCnchen * HRB 135082</div></div></div></div></div></div></div>
</div>
<br></div></div></blockquote></div><br></div>
</div></blockquote></div><br></div></div></div></div></blockquote></div><br=
></div>

--001a11c3726a42cab10529ec49dd--