Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <CABmO9D9kCMmNqvpXr_bKfuC8SH=1nV4z5onVD-mQ+A4oi6KJqw@mail.gmail.com>
References: <CABmO9D9kCMmNqvpXr_bKfuC8SH=1nV4z5onVD-mQ+A4oi6KJqw@mail.gmail.com>
From: Jason Brelloch <jb.bc.flk@gmail.com>
Date: Thu, 28 Jul 2016 13:57:39 -0400
Message-ID: <CAKY1MWrYKBfZPJrj+AB1zM7O66unQwyoesLk6PuHn7_FSV5wCw@mail.gmail.com>
Subject: Re: Reprocessing data in Flink / rebuilding Flink state
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a113ea420ae0a750538b5de61
archived-at: Thu, 28 Jul 2016 17:57:45 -0000

--001a113ea420ae0a750538b5de61
Content-Type: text/plain; charset=UTF-8

Hey Josh,

The way we replay historical data is we have a second Flink job that
listens to the same live stream, and stores every single event in Google
Cloud Storage.

When the main Flink job that is processing the live stream gets a request
for a specific data set that it has not been processing yet, it sends a
request to the historical flink job for the old data.  The live job then
starts storing relevant events from the live stream in state.  It continues
storing the live events until all the events form the historical job have
been processed, then it processes the stored events, and finally starts
processing the live stream again.

As long as it's properly keyed (we key on the specific data set) then it
doesn't block anything, keeps everything ordered, and eventually catches
up.  It also allows us to completely blow away state and rebuild it from
scratch.

So in you case it looks like what you could do is send a request to the
"historical" job whenever you get a item that you don't yet have the
current state of.

The potential problems you may have are that it may not be possible to
store every single historical event, and that you need to make sure there
is enough memory to handle the ever increasing state size while the
historical events are being replayed (and make sure to clear the state when
it is done).

It's a little complicated, and pretty expensive, but it works.  Let me know
if something doesn't make sense.


On Thu, Jul 28, 2016 at 1:14 PM, Josh <jofo90@gmail.com> wrote:

> Hi all,
>
> I was wondering what approaches people usually take with reprocessing data
> with Flink - specifically the case where you want to upgrade a Flink job,
> and make it reprocess historical data before continuing to process a live
> stream.
>
> I'm wondering if we can do something similar to the 'simple rewind' or
> 'parallel rewind' which Samza uses to solve this problem, discussed here:
> https://samza.apache.org/learn/documentation/0.10/jobs/reprocessing.html
>
> Having used Flink over the past couple of months, the main issue I've had
> involves Flink's internal state - from my experience it seems it is easy to
> break the state when upgrading a job, or when changing the parallelism of
> operators, plus there's no easy way to view/access an internal key-value
> state from outside Flink.
>
> For an example of what I mean, consider a Flink job which consumes a
> stream of 'updates' to items, and maintains a key-value store of items
> within Flink's internal state (e.g. in RocksDB). The job also writes the
> updated items to a Kafka topic:
>
> http://oi64.tinypic.com/34q5opf.jpg
>
> My worry with this is that the state in RocksDB could be lost or become
> incompatible with an updated version of the job. If this happens, we need
> to be able to rebuild Flink's internal key-value store in RocksDB. So I'd
> like to be able to do something like this (which I believe is the Samza
> solution):
>
> http://oi67.tinypic.com/219ri95.jpg
>
> Has anyone done something like this already with Flink? If so are there
> any examples of how to do this replay & switchover (rebuild state by
> consuming from a historical log, then switch over to processing the live
> stream)?
>
> Thanks for any insights,
> Josh
>
>


-- 
*Jason Brelloch* | Product Developer
3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305
<http://www.bettercloud.com/>
Subscribe to the BetterCloud Monitor
<https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch>
-
Get IT delivered to your inbox

--001a113ea420ae0a750538b5de61
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hey Josh,<div><br></div><div>The way we replay historical =
data is we have a second Flink job that listens to the same live stream, an=
d stores every single event in Google Cloud Storage. =C2=A0</div><div><br><=
/div><div>When the main Flink job that is processing the live stream gets a=
 request for a specific data set that it has not been processing yet, it se=
nds a request to the historical flink job for the old data.=C2=A0 The live =
job then starts storing relevant events from the live stream in state.=C2=
=A0 It continues storing the live events until all the events form the hist=
orical job have been processed, then it processes the stored events, and fi=
nally starts processing the live stream again.</div><div><br></div><div>As =
long as it&#39;s properly keyed (we key on the specific data set) then it d=
oesn&#39;t block anything, keeps everything ordered, and eventually catches=
 up.=C2=A0 It also allows us to completely blow away state and rebuild it f=
rom scratch.</div><div><br></div><div>So in you case it looks like what you=
 could do is send a request to the &quot;historical&quot; job whenever you =
get a item that you don&#39;t yet have the current state of.=C2=A0</div><di=
v><br></div><div>The potential problems you may have are that it may not be=
 possible to store every single historical event, and that you need to make=
 sure there is enough memory to handle the ever increasing state size while=
 the historical events are being replayed (and make sure to clear the state=
 when it is done).</div><div><br></div><div>It&#39;s a little complicated, =
and pretty expensive, but it works.=C2=A0 Let me know if something doesn=
9;t make sense.</div><div><br></div><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">On Thu, Jul 28, 2016 at 1:14 PM, Josh <span dir=3D"ltr">=
&lt;<a href=3D"mailto:jofo90@gmail.com" target=3D"_blank">jofo90@gmail.com<=
/a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi=
 all,<div><br></div><div>I was wondering what approaches people usually tak=
e with reprocessing data with Flink - specifically the case where you want =
to upgrade a Flink job, and make it reprocess historical data before contin=
uing to process a live stream.</div><div><br></div><div>I&#39;m wondering i=
f we can do something similar to the &#39;simple rewind&#39; or &#39;parall=
el rewind&#39; which Samza uses to solve this problem, discussed here:=C2=
=A0<a href=3D"https://samza.apache.org/learn/documentation/0.10/jobs/reproc=
essing.html" target=3D"_blank">https://samza.apache.org/learn/documentation=
/0.10/jobs/reprocessing.html</a></div><div><br></div><div>Having used Flink=
 over the past couple of months, the main issue I&#39;ve had involves Flink=
&#39;s internal state - from my experience it seems it is easy to break the=
 state when upgrading a job, or when changing the parallelism of operators,=
 plus there&#39;s no easy way to view/access an internal key-value state fr=
om outside Flink.=C2=A0</div><div><br></div><div>For an example of what I m=
ean, consider a Flink job which consumes a stream of &#39;updates&#39; to i=
tems, and maintains a key-value store of items within Flink&#39;s internal =
state (e.g. in RocksDB). The job also writes the updated items to a Kafka t=
opic:</div><div><br></div><div><a href=3D"http://oi64.tinypic.com/34q5opf.j=
pg" target=3D"_blank">http://oi64.tinypic.com/34q5opf.jpg</a><br></div><div=
><br></div><div>My worry with this is that the state in RocksDB could be lo=
st or become incompatible with an updated version of the job. If this happe=
ns, we need to be able to rebuild Flink&#39;s internal key-value store in R=
ocksDB. So I&#39;d like to be able to do something like this (which I belie=
ve is the Samza solution):</div><div><br></div><div><a href=3D"http://oi67.=
tinypic.com/219ri95.jpg" target=3D"_blank">http://oi67.tinypic.com/219ri95.=
jpg</a><br></div><div><br></div><div>Has anyone done something like this al=
ready with Flink? If so are there any examples of how to do this replay &am=
p; switchover (rebuild state by consuming from a historical log, then switc=
h over to processing the live stream)?</div><div><br></div><div>Thanks for =
any insights,</div><div>Josh</div><div><br></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><s=
trong style=3D"color:rgb(32,53,68);font-family:&quot;Open Sans&quot;,Helvet=
ica,sans-serif;font-size:12px">Jason Brelloch</strong><span style=3D"color:=
rgb(32,53,68);font-family:&quot;Open Sans&quot;,Helvetica,sans-serif;font-s=
ize:12px">=C2=A0| Product Developer</span><span style=3D"color:rgb(32,53,68=
);font-family:&quot;Open Sans&quot;,Helvetica,sans-serif;font-size:12px"></=
span><div style=3D"color:rgb(32,53,68);font-family:&quot;Open Sans&quot;,He=
lvetica,sans-serif;font-size:12px"><div style=3D"padding:5px 0px">3405 Pied=
mont Rd. NE, Suite 325, Atlanta, GA 30305=C2=A0</div><a href=3D"http://www.=
bettercloud.com/" target=3D"_blank"><img alt=3D"" src=3D"https://www.better=
cloud.com/wp-content/uploads/email-sig.png" style=3D"width: 144px; margin: =
10px 0px;"></a></div><div style=3D"color:rgb(32,53,68);font-family:&quot;Op=
en Sans&quot;,Helvetica,sans-serif;font-size:12px"><a href=3D"https://www.b=
ettercloud.com/monitor?utm_source=3Dbettercloud_email&amp;utm_medium=3Demai=
l_signature&amp;utm_campaign=3Dmonitor_launch" target=3D"_blank" style=3D"c=
olor:rgb(0,171,228)!important;text-decoration:none!important">Subscribe to =
the BetterCloud Monitor</a>=C2=A0- Get IT delivered to your inbox</div></di=
v></div>
</div></div>

--001a113ea420ae0a750538b5de61--