Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <CAGco--bhTCX18NA+Wwd7Z6_orfpVPmZPMO4xOoKyjM8mr7VGww@mail.gmail.com>
References: <CABmO9D-bqmVj=E9iuKG0=u7t7wqmg_Fwd=LC+x4DEwRr5V62Wg@mail.gmail.com>
	<CAGco--bhTCX18NA+Wwd7Z6_orfpVPmZPMO4xOoKyjM8mr7VGww@mail.gmail.com>
Date: Mon, 23 May 2016 18:36:27 +0100
Message-ID: <CABmO9D-zOadW8COV2NWn3JFdJwudSC361wLnBy7FvO8UYvwYUA@mail.gmail.com>
Subject: Re: Combining streams with static data and using REST API as a sink
From: Josh <jofo90@gmail.com>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a11439356595bd9053385e1ab
archived-at: Mon, 23 May 2016 17:36:33 -0000

--001a11439356595bd9053385e1ab
Content-Type: text/plain; charset=UTF-8

Hi Max,

Thanks, that's very helpful re the REST API sink. For now I don't need
exactly once guarantees for the sink, so I'll just write a simple HTTP sink
implementation. But may need to move to the idempotent version in future!

For 1), that sounds like a simple/easy solution, but how would I handle
occasional updates in that case, since I guess the open() function is only
called once? Do I need to periodically restart the job, or periodically
trigger tasks to restart and refresh their data? Ideally I would want this
job to be running constantly.

Josh

On Mon, May 23, 2016 at 5:56 PM, Maximilian Michels <mxm@apache.org> wrote:

> Hi Josh,
>
> 1) Use a RichFunction which has an `open()` method to load data (e.g. from
> a database) at runtime before the processing starts.
>
> 2) No that's fine. If you want your Rest API Sink to interplay with
> checkpointing (for fault-tolerance), this is a bit tricky though depending
> on the guarantees you want to have. Typically, you would have "at least
> once" or "exactly once" semantics on the state. In Flink, this is easy to
> achieve, it's a bit harder for outside systems.
>
> "At Least Once"
>
> For example, if you increment a counter in a database, this count will be
> off if you recover your job in the case of a failure. You can checkpoint
> the current value of the counter and restore this value on a failure (using
> the Checkpointed interface). However, your counter might decrease
> temporarily when you resume from a checkpoint (until the counter has caught
> up again).
>
> "Exactly Once"
>
> If you want "exactly once" semantics on outside systems (e.g. Rest API),
> you'll need idempotent updates. An idempotent variant of this would be a
> count with a checkpoint id (cid) in your database.
>
> | cid | count |
> |-----+-------|
> |   0 |     3 |
> |   1 |    11 |
> |   2 |    20 |
> |   3 |   120 |
> |   4 |   137 |
> |   5 |   158 |
>
> You would then always read the newest cid value for presentation. You
> would only write to the database once you know you have completed the
> checkpoint (CheckpointListener). You can still fail while doing that, so
> you need to keep the confirmation around in the checkpoint such that you
> can confirm again after restore. It is important that confirmation can be
> done multiple times without affecting the result (idempotent). On recovery
> from a checkpoint, you want to delete all rows higher with a cid higher
> than the one you resume from. For example, if you fail after checkpoint 3
> has been created, you'll confirm 3 (because you might have failed before
> you could confirm) and then delete 4 and 5 before starting the computation
> again.
>
> You see, that strong consistency guarantees can be a bit tricky. If you
> don't need strong guarantees and undercounting is ok for you, implement a
> simple checkpointing for "at least once" using the Checkpointed interface
> or the KeyValue state if your counter is scoped by key.
>
> Cheers,
> Max
>
>
> On Mon, May 23, 2016 at 3:22 PM, Josh <jofo90@gmail.com> wrote:
> > Hi all,
> >
> > I am new to Flink and have a couple of questions which I've had trouble
> > finding answers to online. Any advice would be much appreciated!
> >
> > What's a typical way of handling the scenario where you want to join
> > streaming data with a (relatively) static data source? For example, if I
> > have a stream 'orders' where each order has an 'item_id', and I want to
> join
> > this stream with my database of 'items'. The database of items is mostly
> > static (with perhaps a few new items added every day). The database can
> be
> > retrieved either directly from a standard SQL database (postgres) or via
> a
> > REST call. I guess one way to handle this would be to distribute the
> > database of items with the Flink tasks, and to redeploy the entire job if
> > the items database changes. But I think there's probably a better way to
> do
> > it?
> > I'd like my Flink job to output state to a REST API. (i.e. using the REST
> > API as a sink). Updates would be incremental, e.g. the job would output
> > tumbling window counts which need to be added to some property on a REST
> > resource, so I'd probably implement this as a PATCH. I haven't found much
> > evidence that anyone else has used a REST API as a Flink sink - is there
> a
> > reason why this might be a bad idea?
> >
> > Thanks for any advice on these,
> >
> > Josh
>
>

--001a11439356595bd9053385e1ab
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Max,<div><br></div><div>Thanks, that&#39;s very helpful=
 re the REST API sink. For now I don&#39;t need exactly once guarantees for=
 the sink, so I&#39;ll just write a simple HTTP sink implementation. But ma=
y need to move to the idempotent version in future!</div><div><br></div><di=
v>For 1), that sounds like a simple/easy solution, but how would I handle o=
ccasional updates in that case, since I guess the open() function is only c=
alled once? Do I need to periodically restart the job, or periodically trig=
ger tasks to restart and refresh their data? Ideally I would want this job =
to be running constantly.</div><div><br></div><div>Josh</div></div><div cla=
ss=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, May 23, 2016 at 5=
:56 PM, Maximilian Michels <span dir=3D"ltr">&lt;<a href=3D"mailto:mxm@apac=
he.org" target=3D"_blank">mxm@apache.org</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex"><div dir=3D"ltr"><div>Hi Josh,<br><br>1) Use a RichFu=
nction which has an `open()` method to load data (e.g. from a database) at =
runtime before the processing starts.<br><br>2) No that&#39;s fine. If you =
want your Rest API Sink to interplay with checkpointing (for fault-toleranc=
e), this is a bit tricky though depending on the guarantees you want to hav=
e. Typically, you would have &quot;at least once&quot; or &quot;exactly onc=
e&quot; semantics on the state. In Flink, this is easy to achieve, it&#39;s=
 a bit harder for outside systems.<br><br></div><div>&quot;At Least Once&qu=
ot;<br></div><div><br>For example, if you increment a counter in a database=
, this count=20
will be off if you recover your job in the case of a failure. You can check=
point the current value of the counter and restore this value on a failure =
(using the Checkpointed interface). However, your counter might decrease te=
mporarily when you resume from a checkpoint (until the counter has caught u=
p again).<br><br></div><div>&quot;Exactly Once&quot;<br><br>If you want &qu=
ot;exactly once&quot; semantics on outside systems (e.g. Rest API), you&#39=
;ll need idempotent updates. An=20
idempotent variant of this would be a count with a checkpoint id (cid) in y=
our database.<br><span style=3D"font-family:monospace,monospace"><br>| cid =
| count |<br>|-----+-------|<br>| =C2=A0 0 | =C2=A0 =C2=A0 3 |<br>| =C2=A0 =
1 | =C2=A0=C2=A0 11 |<br>| =C2=A0 2 | =C2=A0=C2=A0 20 |<br>| =C2=A0 3 | =C2=
=A0 120 |<br>| =C2=A0 4 | =C2=A0 137 |<br>| =C2=A0 5 | =C2=A0 158 |</span><=
br><br></div>You would then always read the newest cid value for presentati=
on. You would only write to the database once you know you have completed t=
he checkpoint=C2=A0(CheckpointListener). You can still fail while doing tha=
t, so you need to keep the confirmation around in the checkpoint such that =
you can confirm again after restore. It is important that confirmation can =
be done multiple times without affecting the result (idempotent). On recove=
ry from a checkpoint, you want to delete all rows higher with a cid higher =
than the one you resume from. For example, if you fail after checkpoint 3 h=
as been created, you&#39;ll confirm 3 (because you might have failed before=
 you could confirm) and then delete 4 and 5 before starting the computation=
 again.<br><br>You see, that strong consistency guarantees can be a bit tri=
cky. If you don&#39;t need strong guarantees and undercounting is ok for yo=
u, implement a simple checkpointing for &quot;at least once&quot; using the=
 Checkpointed interface or the KeyValue state if your counter is scoped by =
key. <br><br><div>Cheers,<br></div><div>Max<div><div class=3D"h5"><br><br>O=
n Mon, May 23, 2016 at 3:22 PM, Josh &lt;<a href=3D"mailto:jofo90@gmail.com=
" target=3D"_blank">jofo90@gmail.com</a>&gt; wrote:<br>&gt; Hi all,<br>&gt;=
<br>&gt; I am new to Flink and have a couple of questions which I&#39;ve ha=
d trouble<br>&gt; finding answers to online. Any advice would be much appre=
ciated!<br>&gt;<br>&gt; What&#39;s a typical way of handling the scenario w=
here you want to join<br>&gt; streaming data with a (relatively) static dat=
a source? For example, if I<br>&gt; have a stream &#39;orders&#39; where ea=
ch order has an &#39;item_id&#39;, and I want to join<br>&gt; this stream w=
ith my database of &#39;items&#39;. The database of items is mostly<br>&gt;=
 static (with perhaps a few new items added every day). The database can be=
<br>&gt; retrieved either directly from a standard SQL database (postgres) =
or via a<br>&gt; REST call. I guess one way to handle this would be to dist=
ribute the<br>&gt; database of items with the Flink tasks, and to redeploy =
the entire job if<br>&gt; the items database changes. But I think there&#39=
;s probably a better way to do<br>&gt; it?<br>&gt; I&#39;d like my Flink jo=
b to output state to a REST API. (i.e. using the REST<br>&gt; API as a sink=
). Updates would be incremental, e.g. the job would output<br>&gt; tumbling=
 window counts which need to be added to some property on a REST<br>&gt; re=
source, so I&#39;d probably implement this as a PATCH. I haven&#39;t found =
much<br>&gt; evidence that anyone else has used a REST API as a Flink sink =
- is there a<br>&gt; reason why this might be a bad idea?<br>&gt;<br>&gt; T=
hanks for any advice on these,<br>&gt;<br>&gt; Josh<br><br></div></div></di=
v></div>
</blockquote></div><br></div>

--001a11439356595bd9053385e1ab--