Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
From: Aljoscha Krettek <aljoscha@apache.org>
Message-Id: <F6EC207C-36F2-4663-8083-DC8F4C7DF31B@apache.org>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_9899BE29-FB1D-4281-B2D6-BA304138AA51"
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
Subject: Re: Stateful streaming question
Date: Thu, 15 Jun 2017 12:16:18 +0200
In-Reply-To: <CAELUF_AGKkc+qPRNy4b_HCOATXfjNuokhhBQKNH6TLJ8xphVhA@mail.gmail.com>
Cc: Kostas Kloudas <k.kloudas@data-artisans.com>,
 =?utf-8?Q?Fabian_H=C3=BCske?= <fhueske@gmail.com>,
 "Jain, Ankit" <ankit.jain@here.com>,
 user <user@flink.apache.org>
To: Flavio Pompermaier <pompermaier@okkam.it>
References: <CAELUF_BQ9WSgawjMSDiPx_EuGDAiXpWf5QF4bOFxGgVvA+udMw@mail.gmail.com>
 <6F7079DF-2FDF-4A98-90EF-A1DCFEA4D033@here.com>
 <CAAdrtT0q=fWP4Pof-xxx8f1ua8=vzM1Yw2O5CJ-YqfpVzQhBnQ@mail.gmail.com>
 <6C4F2255-97BF-42B1-ACCF-773DD9F95917@data-artisans.com>
 <CAELUF_AGKkc+qPRNy4b_HCOATXfjNuokhhBQKNH6TLJ8xphVhA@mail.gmail.com>
archived-at: Thu, 15 Jun 2017 10:16:28 -0000


--Apple-Mail=_9899BE29-FB1D-4281-B2D6-BA304138AA51
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi,

Trying to revive this somewhat older thread: have you made any progress? =
I think going with a ProcessFunction that keeps all your state =
internally and periodically outputs to, say, Elasticsearch using a sink =
seems like the way to go? You can do the periodic emission using timers =
in the ProcessFunction.=20

In your use case, does the data you would store in the Flink managed =
state have links between data of different keys? This sounds like it =
could be a problem when it comes to consistency when outputting to an =
external system.

Best,
Aljoscha
> On 17. May 2017, at 14:12, Flavio Pompermaier <pompermaier@okkam.it> =
wrote:
>=20
> Hi to all,
> there are a lot of useful discussion points :)
>=20
> I'll try to answer to everybody.
>=20
> @Ankit:=20
> right now we're using Parquet on HDFS to store thrift objects. Those =
objects are essentially structured like
> key
> alternative_key
> list of tuples (representing the state of my Object)
> This model could be potentially modeled as a Monoid and it's very well =
suited for a stateful streaming computation where updates to a single =
key state are not as expansive as a call to any db to get the current =
list of tuples and update back that list with for an update (IMHO). =
Maybe here I'm overestimating Flink streaming capabilities...
> serialization should be ok using thrift, but Flink advice to use =
tuples to have better performance so just after reading the data from =
disk (as a ThriftObject) we convert them to its equivalent =
representation as Tuple3<String, String, List<Tuple4>> version
> Since I currently use Flink to ingest data that (in the end) means =
adding tuples to my objects, it would be perfect to have an "online" =
state of the grouped tuples in order to:
> add/remove tuples to my object very quickly
> from time to time, scan the whole online data (or a part of it) and =
"translate" it into one ore more JSON indices (and put them into =
Elasticsearch)
> @Fabian:
> You're right that batch processes are bot very well suited to work =
with services that can fail...if in a map function the remote call fails =
all the batch job fails...this should be less problematic with streaming =
because there's checkpointing and with async IO  is should be the =
possibile to add some retry/backoff policies in order to not overload =
remote services like db or solr/es indices (maybe it's not already there =
but it should be possible to add). Am I wrong?
>=20
> @Kostas:
>=20
> =46rom what I understood Queryable state is usefult for gets...what if =
I need to scan the entire db? For us it could be better do periodically =
dump the state to RocksDb or HDFS but, as I already said, I'm not sure =
if it is safe to start a batch job that reads the dumped data while, in =
the meantime, a possible update of this dump could happen...is there any =
potential problem to data consistency (indeed tuples within grouped =
objects have references to other objects keys)?
>=20
> Best,
> Flavio
>=20
> On Wed, May 17, 2017 at 10:18 AM, Kostas Kloudas =
<k.kloudas@data-artisans.com <mailto:k.kloudas@data-artisans.com>> =
wrote:
> Hi Flavio,
>=20
> For setting the retries, unfortunately there is no such setting yet =
and, if I am not wrong, in case of a failure of a request,=20
> an exception will be thrown and the job will restart. I am also =
including Till in the thread as he may know better.
>=20
> For consistency guarantees and concurrency control, this depends on =
your underlying backend. But if you want to=20
> have end-to-end control, then you could do as Ankit suggested at his =
point 3), i.e have a single job for the whole pipeline
>  (if this fits your needs of course). This will allow you to set your =
own =E2=80=9Cprecedence=E2=80=9D rules for your operations.
>=20
> Now finally, there is no way currently to expose the state of a job to =
another job. The way to do so is either Queryable
> State, or writing to a Sink. If the problem for having one job is that =
you emit one element at a time, you can always group
> elements together and emit downstream less often, in batches.
> =20
> Finally, if  you need 2 jobs, you can always use a hybrid solution =
where you keep your current state in Flink, and you dump it=20
> to a Sink that is queryable once per week for example. The Sink then =
can be queried at any time, and data will be at most one=20
> week old.
>=20
> Thanks,
> Kostas
>=20
>> On May 17, 2017, at 9:35 AM, Fabian Hueske <fhueske@gmail.com =
<mailto:fhueske@gmail.com>> wrote:
>>=20
>> Hi Ankit, just a brief comment on the batch job is easier than =
streaming job argument. I'm not sure about that.=20
>> I can see that just the batch job might seem easier to implement, but =
this is only one part of the whole story. The operational side of using =
batch is more complex IMO.=20
>> You need a tool to ingest your stream, you need storage for the =
ingested data, you need a periodic scheduler to kick of your batch job, =
and you need to take care of failures if something goes wrong.=20
>> The streaming case, this is not needed or the framework does it for =
you.
>>=20
>> Just my 2 cents, Fabian
>>=20
>> 2017-05-16 20:58 GMT+02:00 Jain, Ankit <ankit.jain@here.com =
<mailto:ankit.jain@here.com>>:
>> Hi Flavio,
>>=20
>> While you wait on an update from Kostas, wanted to understand the =
use-case better and share my thoughts-
>>=20
>> =20
>>=20
>> 1)       Why is current batch mode expensive? Where are you =
persisting the data after updates? Way I see it by moving to Flink, you =
get to use RocksDB(a key-value store) that makes your lookups faster =E2=80=
=93 probably right now you are using a non-indexed store like S3 maybe?
>>=20
>> So, gain is coming from moving to a better persistence store suited =
to your use-case than from batch->streaming. Myabe consider just going =
with a different data store.
>>=20
>> IMHO, stream should only be used if you really want to act on the new =
events in real-time. It is generally harder to get a streaming job =
correct than a batch one.
>>=20
>> =20
>>=20
>> 2)       If current setup is expensive due to =
serialization-deserialization then that should be fixed by moving to a =
faster format (maybe AVRO? - I don=E2=80=99t have a lot of expertise in =
that). I don=E2=80=99t see how that problem will go away with Flink =E2=80=
=93 so still need to handle serialization.
>>=20
>> =20
>>=20
>> 3)       Even if you do decide to move to Flink =E2=80=93 I think you =
can do this with one job, two jobs are not needed. At every incoming =
event, check the previous state and update/output to kafka or whatever =
data store you are using.
>>=20
>> =20
>>=20
>> =20
>>=20
>> Thanks
>>=20
>> Ankit
>>=20
>> =20
>>=20
>> From: Flavio Pompermaier <pompermaier@okkam.it =
<mailto:pompermaier@okkam.it>>
>> Date: Tuesday, May 16, 2017 at 9:31 AM
>> To: Kostas Kloudas <k.kloudas@data-artisans.com =
<mailto:k.kloudas@data-artisans.com>>
>> Cc: user <user@flink.apache.org <mailto:user@flink.apache.org>>
>> Subject: Re: Stateful streaming question
>>=20
>> =20
>>=20
>> Hi Kostas,
>>=20
>> thanks for your quick response.=20
>>=20
>> I also thought about using Async IO, I just need to figure out how to =
correctly handle parallelism and number of async requests.=20
>>=20
>> However that's probably the way to go..is it possible also to set a =
number of retry attempts/backoff when the async request fails (maybe due =
to a too busy server)?
>>=20
>> =20
>>=20
>> For the second part I think it's ok to persist the state into RocksDB =
or HDFS, my question is indeed about that: is it safe to start reading =
(with another Flink job) from RocksDB or HDFS having an updatable state =
"pending" on it? Should I ensure that state updates are not possible =
until the other Flink job hasn't finish to read the persisted data?
>>=20
>> =20
>>=20
>> And another question...I've tried to draft such a processand =
basically I have the following code:
>>=20
>> =20
>>=20
>> DataStream<MyGroupedObj> groupedObj =3D tuples.keyBy(0)
>>=20
>>         .flatMap(new RichFlatMapFunction<Tuple4, MyGroupedObj>() {
>>=20
>> =20
>>=20
>>           private transient ValueState<MyGroupedObj> state;
>>=20
>> =20
>>=20
>>           @Override
>>=20
>>           public void flatMap(Tuple4 t, Collector<MyGroupedObj> out) =
throws Exception {
>>=20
>>             MyGroupedObj current =3D state.value();
>>=20
>>             if (current =3D=3D null) {
>>=20
>>               current =3D new MyGroupedObj();
>>=20
>>             }
>>=20
>>             ....
>>=20
>>            current.addTuple(t);
>>=20
>>             ...=20
>>=20
>>             state.update(current);
>>=20
>>             out.collect(current);
>>=20
>>           }
>>=20
>>          =20
>>=20
>>           @Override
>>=20
>>           public void open(Configuration config) {
>>=20
>>             ValueStateDescriptor<MyGroupedObj> descriptor =3D
>>=20
>>                       new ValueStateDescriptor<>( =
"test",TypeInformation.of(MyGroupedObj.class));
>>=20
>>               state =3D getRuntimeContext().getState(descriptor);
>>=20
>>           }
>>=20
>>         });
>>=20
>>     groupedObj.print();
>>=20
>> =20
>>=20
>> but obviously this way I emit the updated object on every update =
while, actually, I just want to persist the ValueState somehow (and make =
it available to another job that runs one/moth for example). Is that =
possible?
>>=20
>> =20
>>=20
>> =20
>>=20
>> On Tue, May 16, 2017 at 5:57 PM, Kostas Kloudas =
<k.kloudas@data-artisans.com <mailto:k.kloudas@data-artisans.com>> =
wrote:
>>=20
>> Hi Flavio,
>>=20
>> =20
>>=20
>> =46rom what I understand, for the first part you are correct. You can =
use Flink=E2=80=99s internal state to keep your enriched data.
>>=20
>> In fact, if you are also querying an external system to enrich your =
data, it is worth looking at the AsyncIO feature:
>>=20
>> =20
>>=20
>> =
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asy=
ncio.html =
<https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/as=
yncio.html>
>> =20
>>=20
>> Now for the second part, currently in Flink you cannot iterate over =
all registered keys for which you have state. A pointer=20
>>=20
>> to look at the may be useful is the queryable state:
>>=20
>> =20
>>=20
>> =
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/que=
ryable_state.html =
<https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/qu=
eryable_state.html>
>> =20
>>=20
>> This is still an experimental feature, but let us know your opinion =
if you use it.
>>=20
>> =20
>>=20
>> Finally, an alternative would be to keep state in Flink, and =
periodically flush it to an external storage system, which you can
>>=20
>> query at will.
>>=20
>> =20
>>=20
>> Thanks,
>>=20
>> Kostas
>>=20
>> =20
>>=20
>> =20
>>=20
>> On May 16, 2017, at 4:38 PM, Flavio Pompermaier <pompermaier@okkam.it =
<mailto:pompermaier@okkam.it>> wrote:
>>=20
>> =20
>>=20
>> Hi to all,
>>=20
>> we're still playing with Flink streaming part in order to see whether =
it can improve our current batch pipeline.
>>=20
>> At the moment, we have a job that translate incoming data (as Row) =
into Tuple4, groups them together by the first field and persist the =
result to disk (using a thrift object). When we need to add tuples to =
those grouped objects we need to read again the persisted data, flat it =
back to Tuple4, union with the new tuples, re-group by key and finally =
persist.
>>=20
>> =20
>>=20
>> This is very expansive to do with batch computation while is should =
pretty straightforward to do with streaming (from what I understood): I =
just need to use ListState. Right?
>>=20
>> Then, let's say I need to scan all the data of the stateful =
computation (key and values), in order to do some other computation, I'd =
like to know:
>>=20
>> how to do that? I.e. create a DataSet/DataSource<Key,Value> from the =
stateful data in the stream
>> is there any problem to access the stateful data without stopping =
incoming data (and thus possible updates to the states)?
>> Thanks in advance for the support,
>>=20
>> Flavio
>>=20
>> =20
>>=20
>> =20
>>=20
>>=20
>>=20
>>=20
>> =20
>>=20
>> --
>>=20
>> Flavio Pompermaier
>> Development Department
>>=20
>> OKKAM S.r.l.
>> Tel. +(39) 0461 1823908 <tel:+39%200461%20182%203908>
>=20
>=20


--Apple-Mail=_9899BE29-FB1D-4281-B2D6-BA304138AA51
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">Hi,<div class=3D""><br class=3D""></div><div class=3D"">Trying =
to revive this somewhat older thread: have you made any progress? I =
think going with a ProcessFunction that keeps all your state internally =
and periodically outputs to, say, Elasticsearch using a sink seems like =
the way to go? You can do the periodic emission using timers in the =
ProcessFunction.&nbsp;</div><div class=3D""><br class=3D""></div><div =
class=3D"">In your use case, does the data you would store in the Flink =
managed state have links between data of different keys? This sounds =
like it could be a problem when it comes to consistency when outputting =
to an external system.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Best,</div><div class=3D"">Aljoscha<br =
class=3D""><div><blockquote type=3D"cite" class=3D""><div class=3D"">On =
17. May 2017, at 14:12, Flavio Pompermaier &lt;<a =
href=3D"mailto:pompermaier@okkam.it" =
class=3D"">pompermaier@okkam.it</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">Hi to all,<div class=3D"">there are a lot of useful =
discussion points :)</div><div class=3D""><br class=3D""></div><div =
class=3D"">I'll try to answer to everybody.<br class=3D""><div =
class=3D""><br class=3D""></div><div class=3D"">@Ankit:&nbsp;</div><div =
class=3D""><ul class=3D""><li class=3D"">right now we're using Parquet =
on HDFS to store thrift objects. Those objects are essentially =
structured like</li><ul class=3D""><li class=3D"">key</li><li =
class=3D"">alternative_key</li><li class=3D"">list of tuples =
(representing the state of my Object)</li><li class=3D"">This model =
could be potentially modeled as a Monoid and it's very well suited for a =
stateful streaming computation where updates to a single key state are =
not as expansive as a call to any db to get the current list of tuples =
and update back that list with for an update (IMHO). Maybe here I'm =
overestimating Flink streaming capabilities...</li></ul><li =
class=3D"">serialization should be ok using thrift, but Flink advice to =
use tuples to have better performance so just after reading the data =
from disk (as a ThriftObject) we convert them to its equivalent =
representation as Tuple3&lt;String, String, List&lt;Tuple4&gt;&gt; =
version</li><li class=3D"">Since I currently use Flink to ingest data =
that (in the end) means adding tuples to my objects, it would be perfect =
to have an "online" state of the grouped tuples in order to:</li><ul =
class=3D""><li class=3D"">add/remove tuples to my object very =
quickly</li><li class=3D"">from time to time, scan the whole online data =
(or a part of it) and "translate" it into one ore more JSON indices (and =
put them into Elasticsearch)</li></ul></ul></div><div =
class=3D"">@Fabian:</div><div class=3D"">You're right that batch =
processes are bot very well suited to work with services that can =
fail...if in a map function the remote call fails all the batch job =
fails...this should be less problematic with streaming because there's =
checkpointing and with async IO &nbsp;is should be the possibile to add =
some retry/backoff policies in order to not overload remote services =
like db or solr/es indices (maybe it's not already there but it should =
be possible to add). Am I wrong?</div><div class=3D""><br =
class=3D""></div><div class=3D"">@Kostas:</div><div class=3D""><br =
class=3D""></div><div class=3D"">=46rom what I understood&nbsp;<span =
style=3D"font-size:12.8px" class=3D"">Queryable state is usefult for =
gets...what if I need to scan the entire db? For us it could be better =
do periodically dump the state to RocksDb or HDFS but, as I already =
said, I'm not sure if it is safe to start a batch job that reads the =
dumped data while, in the meantime, a possible update of this dump could =
happen...is there any potential problem to data consistency (indeed =
tuples within grouped objects have references to other objects =
keys)?</span></div><div class=3D""><br class=3D""></div><div =
class=3D"">Best,</div><div class=3D"">Flavio</div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">On Wed, =
May 17, 2017 at 10:18 AM, Kostas Kloudas <span dir=3D"ltr" =
class=3D"">&lt;<a href=3D"mailto:k.kloudas@data-artisans.com" =
target=3D"_blank" class=3D"">k.kloudas@data-artisans.com</a>&gt;</span> =
wrote:<br class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0px=
 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex"><div style=3D"word-wrap:break-word" =
class=3D"">Hi Flavio,<div class=3D""><br class=3D""></div><div =
class=3D"">For setting the retries, unfortunately there is no such =
setting yet and, if I am not wrong, in case of a failure of a =
request,&nbsp;</div><div class=3D"">an exception will be thrown and the =
job will restart. I am also including Till in the thread as he may know =
better.</div><div class=3D""><br class=3D""></div><div class=3D"">For =
consistency guarantees and concurrency control, this depends on your =
underlying backend. But if you want to&nbsp;</div><div class=3D"">have =
end-to-end control, then you could do as Ankit suggested at his point =
3), i.e have a single job for the whole pipeline</div><div =
class=3D"">&nbsp;(if this fits your needs of course). This will allow =
you to set your own =E2=80=9Cprecedence=E2=80=9D rules for your =
operations.</div><div class=3D""><br class=3D""></div><div class=3D"">Now =
finally, there is no way currently to expose the state of a job to =
another job. The way to do so is either Queryable</div><div =
class=3D"">State, or writing to a Sink. If the problem for having one =
job is that you emit one element at a time, you can always =
group</div><div class=3D"">elements together and emit downstream less =
often, in batches.</div><div class=3D"">&nbsp;</div><div =
class=3D"">Finally, if &nbsp;you need 2 jobs, you can always use a =
hybrid solution where you keep your current state in Flink, and you dump =
it&nbsp;</div><div class=3D"">to a Sink that is queryable once per week =
for example. The Sink then can be queried at any time, and data will be =
at most one&nbsp;</div><div class=3D"">week old.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Thanks,</div><div =
class=3D"">Kostas</div><div class=3D""><div =
class=3D"gmail-m_-8274081116505685147h5"><div class=3D""><br =
class=3D""></div><div class=3D""><blockquote type=3D"cite" class=3D""><div=
 class=3D"">On May 17, 2017, at 9:35 AM, Fabian Hueske &lt;<a =
href=3D"mailto:fhueske@gmail.com" target=3D"_blank" =
class=3D"">fhueske@gmail.com</a>&gt; wrote:</div><br =
class=3D"gmail-m_-8274081116505685147m_437579581663971223Apple-interchange=
-newline"><div class=3D""><div dir=3D"ltr" class=3D""><div class=3D"">Hi =
Ankit, just a brief comment on the batch job is easier than streaming =
job argument. I'm not sure about that. <br class=3D"">I can see that =
just the batch job might seem easier to implement, but this is only one =
part of the whole story. The operational side of using batch is more =
complex IMO. <br class=3D"">You need a tool to ingest your stream, you =
need storage for the ingested data, you need a periodic scheduler to =
kick of your batch job, and you need to take care of failures if =
something goes wrong. <br class=3D"">The streaming case, this is not =
needed or the framework does it for you.<br class=3D""><br =
class=3D""></div>Just my 2 cents, Fabian<br class=3D""></div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">2017-05-16=
 20:58 GMT+02:00 Jain, Ankit <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:ankit.jain@here.com" target=3D"_blank" =
class=3D"">ankit.jain@here.com</a>&gt;</span>:<br class=3D""><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex">


<div bgcolor=3D"white" lang=3D"EN-US" class=3D"">
<div =
class=3D"gmail-m_-8274081116505685147m_437579581663971223m_-19152436806603=
77884WordSection1"><p class=3D"MsoNormal"><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">Hi Flavio,<u =
class=3D""></u><u class=3D""></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">While you wait =
on an update from Kostas, wanted to understand the use-case better and =
share my thoughts-<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><u class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"gmail-m_-8274081116505685147m_437579581663971223m_-19152436806603=
77884MsoListParagraph"><u class=3D""></u><span =
style=3D"font-size:11pt;font-family:calibri" class=3D""><span =
class=3D"">1)<span =
style=3D"font-style:normal;font-variant-ligatures:normal;font-variant-caps=
:normal;font-weight:normal;font-stretch:normal;font-size:7pt;line-height:n=
ormal;font-family:&quot;times new roman&quot;" =
class=3D"">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><u class=3D""></u><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">Why is current =
batch mode expensive? Where are you persisting the data after updates? =
Way I see it by moving to Flink, you get to use RocksDB(a key-value =
store) that makes your
 lookups faster =E2=80=93 probably right now you are using a non-indexed =
store like S3 maybe?<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"gmail-m_-8274081116505685147m_437579581663971223m_-19152436806603=
77884MsoListParagraph"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D"">So, gain is coming from moving to a better persistence store =
suited to your use-case than from batch-&gt;streaming. Myabe consider =
just going with a different data store.<u class=3D""></u><u =
class=3D""></u></span></p><p =
class=3D"gmail-m_-8274081116505685147m_437579581663971223m_-19152436806603=
77884MsoListParagraph"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D"">IMHO, stream should only be used if you really want to act on =
the new events in real-time. It is generally harder to get a streaming =
job correct than a batch one.<u class=3D""></u><u =
class=3D""></u></span></p><p =
class=3D"gmail-m_-8274081116505685147m_437579581663971223m_-19152436806603=
77884MsoListParagraph"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><u class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"gmail-m_-8274081116505685147m_437579581663971223m_-19152436806603=
77884MsoListParagraph"><u class=3D""></u><span =
style=3D"font-size:11pt;font-family:calibri" class=3D""><span =
class=3D"">2)<span =
style=3D"font-style:normal;font-variant-ligatures:normal;font-variant-caps=
:normal;font-weight:normal;font-stretch:normal;font-size:7pt;line-height:n=
ormal;font-family:&quot;times new roman&quot;" =
class=3D"">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><u class=3D""></u><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">If current setup =
is expensive due to serialization-deserialization then that should be =
fixed by moving to a faster format (maybe AVRO? - I don=E2=80=99t have a =
lot of expertise in that).
 I don=E2=80=99t see how that problem will go away with Flink =E2=80=93 =
so still need to handle serialization.<u class=3D""></u><u =
class=3D""></u></span></p><p =
class=3D"gmail-m_-8274081116505685147m_437579581663971223m_-19152436806603=
77884MsoListParagraph"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><u class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"gmail-m_-8274081116505685147m_437579581663971223m_-19152436806603=
77884MsoListParagraph"><u class=3D""></u><span =
style=3D"font-size:11pt;font-family:calibri" class=3D""><span =
class=3D"">3)<span =
style=3D"font-style:normal;font-variant-ligatures:normal;font-variant-caps=
:normal;font-weight:normal;font-stretch:normal;font-size:7pt;line-height:n=
ormal;font-family:&quot;times new roman&quot;" =
class=3D"">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><u class=3D""></u><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">Even if you do =
decide to move to Flink =E2=80=93 I think you can do this with one job, =
two jobs are not needed. At every incoming event, check the previous =
state and update/output to
 kafka or whatever data store you are using.<u class=3D""></u><u =
class=3D""></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-size:11pt;font-family:calibri" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><u class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D"">Thanks<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D"">Ankit<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><u class=3D""></u>&nbsp;<u class=3D""></u></span></p>
<div =
style=3D"border-right:none;border-bottom:none;border-left:none;border-top:=
1pt solid rgb(181,196,223);padding:3pt 0in 0in" class=3D""><p =
class=3D"MsoNormal"><b class=3D""><span style=3D"font-family:calibri" =
class=3D"">From: </span>
</b><span style=3D"font-family:calibri" class=3D"">Flavio Pompermaier =
&lt;<a href=3D"mailto:pompermaier@okkam.it" target=3D"_blank" =
class=3D"">pompermaier@okkam.it</a>&gt;<br class=3D"">
<b class=3D"">Date: </b>Tuesday, May 16, 2017 at 9:31 AM<br class=3D"">
<b class=3D"">To: </b>Kostas Kloudas &lt;<a =
href=3D"mailto:k.kloudas@data-artisans.com" target=3D"_blank" =
class=3D"">k.kloudas@data-artisans.com</a>&gt;<br class=3D"">
<b class=3D"">Cc: </b>user &lt;<a href=3D"mailto:user@flink.apache.org" =
target=3D"_blank" class=3D"">user@flink.apache.org</a>&gt;<br class=3D"">
<b class=3D"">Subject: </b>Re: Stateful streaming question<u =
class=3D""></u><u class=3D""></u></span></p>
</div><div class=3D""><div =
class=3D"gmail-m_-8274081116505685147m_437579581663971223h5">
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Hi Kostas,<u class=3D""></u><u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal">thanks for your quick =
response.&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">I also thought about using Async =
IO, I just need to figure out how to correctly handle parallelism and =
number of async requests.&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">However that's probably the way =
to go..is it possible also to set a number of retry attempts/backoff =
when the async request fails (maybe due to a too busy server)?<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">For the second part I think it's =
ok to persist the state into RocksDB or HDFS, my question is indeed =
about that: is it safe to start reading (with another Flink job) from =
RocksDB or HDFS having an updatable state "pending" on it? Should
 I ensure that state updates are not possible until the other Flink job =
hasn't finish to read the persisted data?<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">And another question...I've tried =
to draft such a processand basically I have the following code:<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal">DataStream&lt;MyGroupedObj&gt; =
groupedObj =3D tuples.keyBy(0)<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
.flatMap(new RichFlatMapFunction&lt;Tuple4, MyGroupedObj&gt;() {<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; private transient ValueState&lt;MyGroupedObj&gt; state;<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; @Override<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; public void flatMap(Tuple4&nbsp;t, Collector&lt;MyGroupedObj&gt; =
out) throws Exception {<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; MyGroupedObj&nbsp;current =3D state.value();<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; if (current =3D=3D null) {<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; current =3D new MyGroupedObj();<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; }<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; ....<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp;current.addTuple(t);<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; ...&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; state.update(current);<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; out.collect(current);<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; }<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; @Override<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; public void open(Configuration config) {<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; ValueStateDescriptor&lt;MyGrouped<wbr class=3D"">Obj&gt; =
descriptor =3D<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; new =
ValueStateDescriptor&lt;&gt;( "test",TypeInformation.of(MyGr<wbr =
class=3D"">oupedObj.class));<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; state =3D getRuntimeContext().getState(d<wbr =
class=3D"">escriptor);<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; }<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; });<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; =
groupedObj.print();<u class=3D""></u><u class=3D""></u></p>
</div>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">but obviously this way I emit the =
updated object on every update while, actually, I just want to persist =
the ValueState somehow (and make it available to another job that runs =
one/moth for example). Is that possible?<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal">On Tue, May 16, 2017 at 5:57 PM, =
Kostas Kloudas &lt;<a href=3D"mailto:k.kloudas@data-artisans.com" =
target=3D"_blank" class=3D"">k.kloudas@data-artisans.com</a>&gt; =
wrote:<u class=3D""></u><u class=3D""></u></p>
<blockquote =
style=3D"border-top:none;border-right:none;border-bottom:none;border-left:=
1pt solid rgb(204,204,204);padding:0in 0in 0in =
6pt;margin-left:4.8pt;margin-right:0in" class=3D"">
<div class=3D""><p class=3D"MsoNormal">Hi Flavio,<u class=3D""></u><u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">=46rom what I understand, for the =
first part you are correct. You can use Flink=E2=80=99s internal state =
to keep your enriched data.<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">In fact, if you are also querying =
an external system to enrich your data, it is worth looking at the =
AsyncIO feature:<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><a =
href=3D"https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/st=
ream/asyncio.html" target=3D"_blank" =
class=3D"">https://ci.apache.org/projects<wbr =
class=3D"">/flink/flink-docs-release-1.2/<wbr =
class=3D"">dev/stream/asyncio.html</a><u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Now for the second part, =
currently in Flink you cannot iterate over all registered keys for which =
you have state. A pointer&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">to look at the may be useful is =
the queryable state:<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><a =
href=3D"https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/st=
ream/queryable_state.html" target=3D"_blank" =
class=3D"">https://ci.apache.org/projects<wbr =
class=3D"">/flink/flink-docs-release-1.2/<wbr =
class=3D"">dev/stream/queryable_state.htm<wbr class=3D"">l</a><u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">This is still an experimental =
feature, but let us know your opinion if you use it.<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Finally, an alternative would be =
to keep state in Flink, and periodically flush it to an external storage =
system, which you can<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">query at will.<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Thanks,<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Kostas<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D"">
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
<div class=3D"">
<blockquote style=3D"margin-top:5pt;margin-bottom:5pt" class=3D"">
<div class=3D""><p class=3D"MsoNormal">On May 16, 2017, at 4:38 PM, =
Flavio Pompermaier &lt;<a href=3D"mailto:pompermaier@okkam.it" =
target=3D"_blank" class=3D"">pompermaier@okkam.it</a>&gt; wrote:<u =
class=3D""></u><u class=3D""></u></p>
</div><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal">Hi to all,<u class=3D""></u><u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal">we're still playing with Flink =
streaming part in order to see whether it can improve our current batch =
pipeline.<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">At the moment, we have a job that =
translate incoming data (as Row) into Tuple4, groups them together by =
the first field and persist the result to disk (using a thrift object). =
When we need to add tuples to those grouped objects we need
 to read again the persisted data, flat it back to Tuple4, union with =
the new tuples, re-group by key and finally persist.<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">This is very expansive to do with =
batch computation while is should pretty straightforward to do with =
streaming (from what I understood): I just need to use&nbsp;ListState. =
Right?<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Then, let's say I need to scan =
all the data of the stateful computation (key and values), in order to =
do some other computation, I'd like to know:<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D"">
<ul type=3D"disc" class=3D"">
<li class=3D"MsoNormal">
how to do that? I.e. create a DataSet/DataSource&lt;Key,Value&gt; from =
the stateful data in the stream<u class=3D""></u><u =
class=3D""></u></li><li class=3D"MsoNormal">
is there any problem to access the stateful data without stopping =
incoming data (and thus possible updates to the states)?<u =
class=3D""></u><u class=3D""></u></li></ul>
<div class=3D""><p class=3D"MsoNormal">Thanks in advance for the =
support,<u class=3D""></u><u class=3D""></u></p>
</div>
</div>
<div class=3D""><p class=3D"MsoNormal">Flavio<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
</div>
</div>
</div>
</blockquote>
</div><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
</div>
</div>
</div>
</blockquote>
</div><p class=3D"MsoNormal"><br class=3D"">
<br clear=3D"all" class=3D"">
<u class=3D""></u><u class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div><p class=3D"MsoNormal">-- <u class=3D""></u><u class=3D""></u></p>
<div class=3D"">
<div class=3D"">
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal"><span =
style=3D"color:rgb(153,153,153)" class=3D"">Flavio Pompermaier<br =
class=3D"">
Development Department<br class=3D"">
<br class=3D"">
OKKAM S.r.l.<br class=3D"">
Tel. <a href=3D"tel:+39%200461%20182%203908" target=3D"_blank" =
class=3D"">+(39) 0461 1823908</a></span><u class=3D""></u><u =
class=3D""></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div></div></div>
</div>

</blockquote></div><br class=3D""></div>
</div></blockquote></div><br =
class=3D""></div></div></div></blockquote></div><br class=3D"">
</div></div></div>
</div></blockquote></div><br class=3D""></div></body></html>=

--Apple-Mail=_9899BE29-FB1D-4281-B2D6-BA304138AA51--