Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
From: Aljoscha Krettek <aljoscha@apache.org>
Message-Id: <785F6575-BB2C-45B3-8182-C116119DEA25@apache.org>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_0CD6FA1B-83B4-46CD-865D-802201C61234"
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
Subject: Re: Stateful streaming question
Date: Fri, 16 Jun 2017 11:55:57 +0200
In-Reply-To: <CAELUF_Cs70uvtt7KkEdaiSf5PxnNTWBmuExw5ivYRyfSLRw6Eg@mail.gmail.com>
Cc: Kostas Kloudas <k.kloudas@data-artisans.com>,
 =?utf-8?Q?Fabian_H=C3=BCske?= <fhueske@gmail.com>,
 "Jain, Ankit" <ankit.jain@here.com>,
 user <user@flink.apache.org>
To: Flavio Pompermaier <pompermaier@okkam.it>
References: <CAELUF_BQ9WSgawjMSDiPx_EuGDAiXpWf5QF4bOFxGgVvA+udMw@mail.gmail.com>
 <6F7079DF-2FDF-4A98-90EF-A1DCFEA4D033@here.com>
 <CAAdrtT0q=fWP4Pof-xxx8f1ua8=vzM1Yw2O5CJ-YqfpVzQhBnQ@mail.gmail.com>
 <6C4F2255-97BF-42B1-ACCF-773DD9F95917@data-artisans.com>
 <CAELUF_AGKkc+qPRNy4b_HCOATXfjNuokhhBQKNH6TLJ8xphVhA@mail.gmail.com>
 <F6EC207C-36F2-4663-8083-DC8F4C7DF31B@apache.org>
 <CAELUF_AFDr8Zhu2qojdRB6qF9XyXzmYJNWoGE2nzus=vMS_gYA@mail.gmail.com>
 <890E8BBF-A488-44EC-822B-E0DCA250D0E6@apache.org>
 <CAELUF_Cs70uvtt7KkEdaiSf5PxnNTWBmuExw5ivYRyfSLRw6Eg@mail.gmail.com>
archived-at: Fri, 16 Jun 2017 09:56:04 -0000


--Apple-Mail=_0CD6FA1B-83B4-46CD-865D-802201C61234
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

I think it might be possible to do but I=E2=80=99m not aware of anyone =
working on that and I haven=E2=80=99t seen anyone on the mailing lists =
express interest in working on that.

> On 16. Jun 2017, at 11:31, Flavio Pompermaier <pompermaier@okkam.it> =
wrote:
>=20
> Ok thanks for the clarification. Do you think it could be possible =
(sooner or later) to have in Flink some sort of synchronization between =
jobs (as in this case where the input datastream should be "paused" =
until the second job finishes)? I know I coould use something like Oozie =
or Falcon to orchestrate jobs but I'd prefer to avoid to add them to our =
architecture..
>=20
> Best,
> Flavio
>=20
> On Fri, Jun 16, 2017 at 11:23 AM, Aljoscha Krettek =
<aljoscha@apache.org <mailto:aljoscha@apache.org>> wrote:
> Hi,
>=20
> I=E2=80=99m afraid not. You would have to wait for one job to finish =
before starting the next one.
>=20
> Best,
> Aljoscha
>> On 15. Jun 2017, at 20:11, Flavio Pompermaier <pompermaier@okkam.it =
<mailto:pompermaier@okkam.it>> wrote:
>>=20
>> Hi Aljoscha,
>> we're still investigating possible solutions here. Yes, as you =
correctly said there are links between data of different keys so we can =
only proceed with the next job only once we are sure at 100% that all =
input data has been consumed and no other data will be read until this =
last jobs ends.
>> There should be some sort of synchronization between these 2 =
jobs...is that possible right now in Flink?
>>=20
>> Thanks a lot for the support,
>> Flavio
>>=20
>> On Thu, Jun 15, 2017 at 12:16 PM, Aljoscha Krettek =
<aljoscha@apache.org <mailto:aljoscha@apache.org>> wrote:
>> Hi,
>>=20
>> Trying to revive this somewhat older thread: have you made any =
progress? I think going with a ProcessFunction that keeps all your state =
internally and periodically outputs to, say, Elasticsearch using a sink =
seems like the way to go? You can do the periodic emission using timers =
in the ProcessFunction.=20
>>=20
>> In your use case, does the data you would store in the Flink managed =
state have links between data of different keys? This sounds like it =
could be a problem when it comes to consistency when outputting to an =
external system.
>>=20
>> Best,
>> Aljoscha
>>=20
>>> On 17. May 2017, at 14:12, Flavio Pompermaier <pompermaier@okkam.it =
<mailto:pompermaier@okkam.it>> wrote:
>>>=20
>>> Hi to all,
>>> there are a lot of useful discussion points :)
>>>=20
>>> I'll try to answer to everybody.
>>>=20
>>> @Ankit:=20
>>> right now we're using Parquet on HDFS to store thrift objects. Those =
objects are essentially structured like
>>> key
>>> alternative_key
>>> list of tuples (representing the state of my Object)
>>> This model could be potentially modeled as a Monoid and it's very =
well suited for a stateful streaming computation where updates to a =
single key state are not as expansive as a call to any db to get the =
current list of tuples and update back that list with for an update =
(IMHO). Maybe here I'm overestimating Flink streaming capabilities...
>>> serialization should be ok using thrift, but Flink advice to use =
tuples to have better performance so just after reading the data from =
disk (as a ThriftObject) we convert them to its equivalent =
representation as Tuple3<String, String, List<Tuple4>> version
>>> Since I currently use Flink to ingest data that (in the end) means =
adding tuples to my objects, it would be perfect to have an "online" =
state of the grouped tuples in order to:
>>> add/remove tuples to my object very quickly
>>> from time to time, scan the whole online data (or a part of it) and =
"translate" it into one ore more JSON indices (and put them into =
Elasticsearch)
>>> @Fabian:
>>> You're right that batch processes are bot very well suited to work =
with services that can fail...if in a map function the remote call fails =
all the batch job fails...this should be less problematic with streaming =
because there's checkpointing and with async IO  is should be the =
possibile to add some retry/backoff policies in order to not overload =
remote services like db or solr/es indices (maybe it's not already there =
but it should be possible to add). Am I wrong?
>>>=20
>>> @Kostas:
>>>=20
>>> =46rom what I understood Queryable state is usefult for gets...what =
if I need to scan the entire db? For us it could be better do =
periodically dump the state to RocksDb or HDFS but, as I already said, =
I'm not sure if it is safe to start a batch job that reads the dumped =
data while, in the meantime, a possible update of this dump could =
happen...is there any potential problem to data consistency (indeed =
tuples within grouped objects have references to other objects keys)?
>>>=20
>>> Best,
>>> Flavio
>>>=20
>>> On Wed, May 17, 2017 at 10:18 AM, Kostas Kloudas =
<k.kloudas@data-artisans.com <mailto:k.kloudas@data-artisans.com>> =
wrote:
>>> Hi Flavio,
>>>=20
>>> For setting the retries, unfortunately there is no such setting yet =
and, if I am not wrong, in case of a failure of a request,=20
>>> an exception will be thrown and the job will restart. I am also =
including Till in the thread as he may know better.
>>>=20
>>> For consistency guarantees and concurrency control, this depends on =
your underlying backend. But if you want to=20
>>> have end-to-end control, then you could do as Ankit suggested at his =
point 3), i.e have a single job for the whole pipeline
>>>  (if this fits your needs of course). This will allow you to set =
your own =E2=80=9Cprecedence=E2=80=9D rules for your operations.
>>>=20
>>> Now finally, there is no way currently to expose the state of a job =
to another job. The way to do so is either Queryable
>>> State, or writing to a Sink. If the problem for having one job is =
that you emit one element at a time, you can always group
>>> elements together and emit downstream less often, in batches.
>>> =20
>>> Finally, if  you need 2 jobs, you can always use a hybrid solution =
where you keep your current state in Flink, and you dump it=20
>>> to a Sink that is queryable once per week for example. The Sink then =
can be queried at any time, and data will be at most one=20
>>> week old.
>>>=20
>>> Thanks,
>>> Kostas
>>>=20
>>>> On May 17, 2017, at 9:35 AM, Fabian Hueske <fhueske@gmail.com =
<mailto:fhueske@gmail.com>> wrote:
>>>>=20
>>>> Hi Ankit, just a brief comment on the batch job is easier than =
streaming job argument. I'm not sure about that.=20
>>>> I can see that just the batch job might seem easier to implement, =
but this is only one part of the whole story. The operational side of =
using batch is more complex IMO.=20
>>>> You need a tool to ingest your stream, you need storage for the =
ingested data, you need a periodic scheduler to kick of your batch job, =
and you need to take care of failures if something goes wrong.=20
>>>> The streaming case, this is not needed or the framework does it for =
you.
>>>>=20
>>>> Just my 2 cents, Fabian
>>>>=20
>>>> 2017-05-16 20:58 GMT+02:00 Jain, Ankit <ankit.jain@here.com =
<mailto:ankit.jain@here.com>>:
>>>> Hi Flavio,
>>>>=20
>>>> While you wait on an update from Kostas, wanted to understand the =
use-case better and share my thoughts-
>>>>=20
>>>> =20
>>>>=20
>>>> 1)       Why is current batch mode expensive? Where are you =
persisting the data after updates? Way I see it by moving to Flink, you =
get to use RocksDB(a key-value store) that makes your lookups faster =E2=80=
=93 probably right now you are using a non-indexed store like S3 maybe?
>>>>=20
>>>> So, gain is coming from moving to a better persistence store suited =
to your use-case than from batch->streaming. Myabe consider just going =
with a different data store.
>>>>=20
>>>> IMHO, stream should only be used if you really want to act on the =
new events in real-time. It is generally harder to get a streaming job =
correct than a batch one.
>>>>=20
>>>> =20
>>>>=20
>>>> 2)       If current setup is expensive due to =
serialization-deserialization then that should be fixed by moving to a =
faster format (maybe AVRO? - I don=E2=80=99t have a lot of expertise in =
that). I don=E2=80=99t see how that problem will go away with Flink =E2=80=
=93 so still need to handle serialization.
>>>>=20
>>>> =20
>>>>=20
>>>> 3)       Even if you do decide to move to Flink =E2=80=93 I think =
you can do this with one job, two jobs are not needed. At every incoming =
event, check the previous state and update/output to kafka or whatever =
data store you are using.
>>>>=20
>>>> =20
>>>>=20
>>>> =20
>>>>=20
>>>> Thanks
>>>>=20
>>>> Ankit
>>>>=20
>>>> =20
>>>>=20
>>>> From: Flavio Pompermaier <pompermaier@okkam.it =
<mailto:pompermaier@okkam.it>>
>>>> Date: Tuesday, May 16, 2017 at 9:31 AM
>>>> To: Kostas Kloudas <k.kloudas@data-artisans.com =
<mailto:k.kloudas@data-artisans.com>>
>>>> Cc: user <user@flink.apache.org <mailto:user@flink.apache.org>>
>>>> Subject: Re: Stateful streaming question
>>>>=20
>>>> =20
>>>>=20
>>>> Hi Kostas,
>>>>=20
>>>> thanks for your quick response.=20
>>>>=20
>>>> I also thought about using Async IO, I just need to figure out how =
to correctly handle parallelism and number of async requests.=20
>>>>=20
>>>> However that's probably the way to go..is it possible also to set a =
number of retry attempts/backoff when the async request fails (maybe due =
to a too busy server)?
>>>>=20
>>>> =20
>>>>=20
>>>> For the second part I think it's ok to persist the state into =
RocksDB or HDFS, my question is indeed about that: is it safe to start =
reading (with another Flink job) from RocksDB or HDFS having an =
updatable state "pending" on it? Should I ensure that state updates are =
not possible until the other Flink job hasn't finish to read the =
persisted data?
>>>>=20
>>>> =20
>>>>=20
>>>> And another question...I've tried to draft such a processand =
basically I have the following code:
>>>>=20
>>>> =20
>>>>=20
>>>> DataStream<MyGroupedObj> groupedObj =3D tuples.keyBy(0)
>>>>=20
>>>>         .flatMap(new RichFlatMapFunction<Tuple4, MyGroupedObj>() {
>>>>=20
>>>> =20
>>>>=20
>>>>           private transient ValueState<MyGroupedObj> state;
>>>>=20
>>>> =20
>>>>=20
>>>>           @Override
>>>>=20
>>>>           public void flatMap(Tuple4 t, Collector<MyGroupedObj> =
out) throws Exception {
>>>>=20
>>>>             MyGroupedObj current =3D state.value();
>>>>=20
>>>>             if (current =3D=3D null) {
>>>>=20
>>>>               current =3D new MyGroupedObj();
>>>>=20
>>>>             }
>>>>=20
>>>>             ....
>>>>=20
>>>>            current.addTuple(t);
>>>>=20
>>>>             ...=20
>>>>=20
>>>>             state.update(current);
>>>>=20
>>>>             out.collect(current);
>>>>=20
>>>>           }
>>>>=20
>>>>          =20
>>>>=20
>>>>           @Override
>>>>=20
>>>>           public void open(Configuration config) {
>>>>=20
>>>>             ValueStateDescriptor<MyGroupedObj> descriptor =3D
>>>>=20
>>>>                       new ValueStateDescriptor<>( =
"test",TypeInformation.of(MyGroupedObj.class));
>>>>=20
>>>>               state =3D getRuntimeContext().getState(descriptor);
>>>>=20
>>>>           }
>>>>=20
>>>>         });
>>>>=20
>>>>     groupedObj.print();
>>>>=20
>>>> =20
>>>>=20
>>>> but obviously this way I emit the updated object on every update =
while, actually, I just want to persist the ValueState somehow (and make =
it available to another job that runs one/moth for example). Is that =
possible?
>>>>=20
>>>> =20
>>>>=20
>>>> =20
>>>>=20
>>>> On Tue, May 16, 2017 at 5:57 PM, Kostas Kloudas =
<k.kloudas@data-artisans.com <mailto:k.kloudas@data-artisans.com>> =
wrote:
>>>>=20
>>>> Hi Flavio,
>>>>=20
>>>> =20
>>>>=20
>>>> =46rom what I understand, for the first part you are correct. You =
can use Flink=E2=80=99s internal state to keep your enriched data.
>>>>=20
>>>> In fact, if you are also querying an external system to enrich your =
data, it is worth looking at the AsyncIO feature:
>>>>=20
>>>> =20
>>>>=20
>>>> =
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asy=
ncio.html =
<https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/as=
yncio.html>
>>>> =20
>>>>=20
>>>> Now for the second part, currently in Flink you cannot iterate over =
all registered keys for which you have state. A pointer=20
>>>>=20
>>>> to look at the may be useful is the queryable state:
>>>>=20
>>>> =20
>>>>=20
>>>> =
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/que=
ryable_state.html =
<https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/qu=
eryable_state.html>
>>>> =20
>>>>=20
>>>> This is still an experimental feature, but let us know your opinion =
if you use it.
>>>>=20
>>>> =20
>>>>=20
>>>> Finally, an alternative would be to keep state in Flink, and =
periodically flush it to an external storage system, which you can
>>>>=20
>>>> query at will.
>>>>=20
>>>> =20
>>>>=20
>>>> Thanks,
>>>>=20
>>>> Kostas
>>>>=20
>>>> =20
>>>>=20
>>>> =20
>>>>=20
>>>> On May 16, 2017, at 4:38 PM, Flavio Pompermaier =
<pompermaier@okkam.it <mailto:pompermaier@okkam.it>> wrote:
>>>>=20
>>>> =20
>>>>=20
>>>> Hi to all,
>>>>=20
>>>> we're still playing with Flink streaming part in order to see =
whether it can improve our current batch pipeline.
>>>>=20
>>>> At the moment, we have a job that translate incoming data (as Row) =
into Tuple4, groups them together by the first field and persist the =
result to disk (using a thrift object). When we need to add tuples to =
those grouped objects we need to read again the persisted data, flat it =
back to Tuple4, union with the new tuples, re-group by key and finally =
persist.
>>>>=20
>>>> =20
>>>>=20
>>>> This is very expansive to do with batch computation while is should =
pretty straightforward to do with streaming (from what I understood): I =
just need to use ListState. Right?
>>>>=20
>>>> Then, let's say I need to scan all the data of the stateful =
computation (key and values), in order to do some other computation, I'd =
like to know:
>>>>=20
>>>> how to do that? I.e. create a DataSet/DataSource<Key,Value> from =
the stateful data in the stream
>>>> is there any problem to access the stateful data without stopping =
incoming data (and thus possible updates to the states)?
>>>> Thanks in advance for the support,
>>>>=20
>>>> Flavio
>>>>=20
>>>> =20
>>>>=20
>>>> =20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>> =20
>>>>=20
>>>> --
>>>>=20
>>>> Flavio Pompermaier
>>>> Development Department
>>>>=20
>>>> OKKAM S.r.l.
>>>> Tel. +(39) 0461 1823908 <tel:+39%200461%20182%203908>
>>>=20
>>>=20
>>=20
>>=20
>=20
>=20
>=20


--Apple-Mail=_0CD6FA1B-83B4-46CD-865D-802201C61234
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">I think it might be possible to do but I=E2=80=99m not aware =
of anyone working on that and I haven=E2=80=99t seen anyone on the =
mailing lists express interest in working on that.<div class=3D""><br =
class=3D""><div><blockquote type=3D"cite" class=3D""><div class=3D"">On =
16. Jun 2017, at 11:31, Flavio Pompermaier &lt;<a =
href=3D"mailto:pompermaier@okkam.it" =
class=3D"">pompermaier@okkam.it</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">Ok thanks for the clarification. Do you think it could be =
possible (sooner or later) to have in Flink some sort of synchronization =
between jobs (as in this case where the input datastream should be =
"paused" until the second job finishes)? I know I coould use something =
like Oozie or Falcon to orchestrate jobs but I'd prefer to avoid to add =
them to our architecture..<div class=3D""><br class=3D""></div><div =
class=3D"">Best,</div><div class=3D"">Flavio</div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">On Fri, =
Jun 16, 2017 at 11:23 AM, Aljoscha Krettek <span dir=3D"ltr" =
class=3D"">&lt;<a href=3D"mailto:aljoscha@apache.org" target=3D"_blank" =
class=3D"">aljoscha@apache.org</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi,<div class=3D""><br =
class=3D""></div><div class=3D"">I=E2=80=99m afraid not. You would have =
to wait for one job to finish before starting the next one.</div><div =
class=3D""><br class=3D""></div><div class=3D"">Best,</div><div =
class=3D"">Aljoscha</div><div class=3D""><div class=3D"h5"><div =
class=3D""><div class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D"">On 15. Jun 2017, at 20:11, Flavio Pompermaier &lt;<a =
href=3D"mailto:pompermaier@okkam.it" target=3D"_blank" =
class=3D"">pompermaier@okkam.it</a>&gt; wrote:</div><br =
class=3D"m_2521384099463255565Apple-interchange-newline"><div =
class=3D""><div dir=3D"ltr" class=3D"">Hi Aljoscha,<div class=3D"">we're =
still investigating possible solutions here. Yes, as you correctly said =
there are links between data of different keys so we can only proceed =
with the next job only once we are sure at 100% that all input data has =
been consumed and no other data will be read until this last jobs =
ends.</div><div class=3D"">There should be some sort of synchronization =
between these 2 jobs...is that possible right now in Flink?</div><div =
class=3D""><br class=3D""></div><div class=3D"">Thanks a lot for the =
support,</div><div class=3D"">Flavio</div><div class=3D"gmail_extra"><br =
class=3D""><div class=3D"gmail_quote">On Thu, Jun 15, 2017 at 12:16 PM, =
Aljoscha Krettek <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:aljoscha@apache.org" target=3D"_blank" =
class=3D"">aljoscha@apache.org</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi,<div class=3D""><br =
class=3D""></div><div class=3D"">Trying to revive this somewhat older =
thread: have you made any progress? I think going with a ProcessFunction =
that keeps all your state internally and periodically outputs to, say, =
Elasticsearch using a sink seems like the way to go? You can do the =
periodic emission using timers in the ProcessFunction.&nbsp;</div><div =
class=3D""><br class=3D""></div><div class=3D"">In your use case, does =
the data you would store in the Flink managed state have links between =
data of different keys? This sounds like it could be a problem when it =
comes to consistency when outputting to an external system.</div><div =
class=3D""><br class=3D""></div><div class=3D"">Best,</div><div =
class=3D"">Aljoscha<div class=3D""><div =
class=3D"m_2521384099463255565h5"><br class=3D""><div =
class=3D""><blockquote type=3D"cite" class=3D""><div class=3D"">On 17. =
May 2017, at 14:12, Flavio Pompermaier &lt;<a =
href=3D"mailto:pompermaier@okkam.it" target=3D"_blank" =
class=3D"">pompermaier@okkam.it</a>&gt; wrote:</div><br =
class=3D"m_2521384099463255565m_3994663256654974888Apple-interchange-newli=
ne"><div class=3D""><div dir=3D"ltr" class=3D"">Hi to all,<div =
class=3D"">there are a lot of useful discussion points :)</div><div =
class=3D""><br class=3D""></div><div class=3D"">I'll try to answer to =
everybody.<br class=3D""><div class=3D""><br class=3D""></div><div =
class=3D"">@Ankit:&nbsp;</div><div class=3D""><ul class=3D""><li =
class=3D"">right now we're using Parquet on HDFS to store thrift =
objects. Those objects are essentially structured like</li><ul =
class=3D""><li class=3D"">key</li><li class=3D"">alternative_key</li><li =
class=3D"">list of tuples (representing the state of my Object)</li><li =
class=3D"">This model could be potentially modeled as a Monoid and it's =
very well suited for a stateful streaming computation where updates to a =
single key state are not as expansive as a call to any db to get the =
current list of tuples and update back that list with for an update =
(IMHO). Maybe here I'm overestimating Flink streaming =
capabilities...</li></ul><li class=3D"">serialization should be ok using =
thrift, but Flink advice to use tuples to have better performance so =
just after reading the data from disk (as a ThriftObject) we convert =
them to its equivalent representation as Tuple3&lt;String, String, =
List&lt;Tuple4&gt;&gt; version</li><li class=3D"">Since I currently use =
Flink to ingest data that (in the end) means adding tuples to my =
objects, it would be perfect to have an "online" state of the grouped =
tuples in order to:</li><ul class=3D""><li class=3D"">add/remove tuples =
to my object very quickly</li><li class=3D"">from time to time, scan the =
whole online data (or a part of it) and "translate" it into one ore more =
JSON indices (and put them into Elasticsearch)</li></ul></ul></div><div =
class=3D"">@Fabian:</div><div class=3D"">You're right that batch =
processes are bot very well suited to work with services that can =
fail...if in a map function the remote call fails all the batch job =
fails...this should be less problematic with streaming because there's =
checkpointing and with async IO &nbsp;is should be the possibile to add =
some retry/backoff policies in order to not overload remote services =
like db or solr/es indices (maybe it's not already there but it should =
be possible to add). Am I wrong?</div><div class=3D""><br =
class=3D""></div><div class=3D"">@Kostas:</div><div class=3D""><br =
class=3D""></div><div class=3D"">=46rom what I understood&nbsp;<span =
style=3D"font-size:12.8px" class=3D"">Queryable state is usefult for =
gets...what if I need to scan the entire db? For us it could be better =
do periodically dump the state to RocksDb or HDFS but, as I already =
said, I'm not sure if it is safe to start a batch job that reads the =
dumped data while, in the meantime, a possible update of this dump could =
happen...is there any potential problem to data consistency (indeed =
tuples within grouped objects have references to other objects =
keys)?</span></div><div class=3D""><br class=3D""></div><div =
class=3D"">Best,</div><div class=3D"">Flavio</div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">On Wed, =
May 17, 2017 at 10:18 AM, Kostas Kloudas <span dir=3D"ltr" =
class=3D"">&lt;<a href=3D"mailto:k.kloudas@data-artisans.com" =
target=3D"_blank" class=3D"">k.kloudas@data-artisans.com</a>&gt;</span> =
wrote:<br class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0px=
 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex"><div style=3D"word-wrap:break-word" =
class=3D"">Hi Flavio,<div class=3D""><br class=3D""></div><div =
class=3D"">For setting the retries, unfortunately there is no such =
setting yet and, if I am not wrong, in case of a failure of a =
request,&nbsp;</div><div class=3D"">an exception will be thrown and the =
job will restart. I am also including Till in the thread as he may know =
better.</div><div class=3D""><br class=3D""></div><div class=3D"">For =
consistency guarantees and concurrency control, this depends on your =
underlying backend. But if you want to&nbsp;</div><div class=3D"">have =
end-to-end control, then you could do as Ankit suggested at his point =
3), i.e have a single job for the whole pipeline</div><div =
class=3D"">&nbsp;(if this fits your needs of course). This will allow =
you to set your own =E2=80=9Cprecedence=E2=80=9D rules for your =
operations.</div><div class=3D""><br class=3D""></div><div class=3D"">Now =
finally, there is no way currently to expose the state of a job to =
another job. The way to do so is either Queryable</div><div =
class=3D"">State, or writing to a Sink. If the problem for having one =
job is that you emit one element at a time, you can always =
group</div><div class=3D"">elements together and emit downstream less =
often, in batches.</div><div class=3D"">&nbsp;</div><div =
class=3D"">Finally, if &nbsp;you need 2 jobs, you can always use a =
hybrid solution where you keep your current state in Flink, and you dump =
it&nbsp;</div><div class=3D"">to a Sink that is queryable once per week =
for example. The Sink then can be queried at any time, and data will be =
at most one&nbsp;</div><div class=3D"">week old.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Thanks,</div><div =
class=3D"">Kostas</div><div class=3D""><div =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147h5"><div class=3D""><br class=3D""></div><div class=3D""><blockquote =
type=3D"cite" class=3D""><div class=3D"">On May 17, 2017, at 9:35 AM, =
Fabian Hueske &lt;<a href=3D"mailto:fhueske@gmail.com" target=3D"_blank" =
class=3D"">fhueske@gmail.com</a>&gt; wrote:</div><br =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223Apple-interchange-newline"><div class=3D""><div =
dir=3D"ltr" class=3D""><div class=3D"">Hi Ankit, just a brief comment on =
the batch job is easier than streaming job argument. I'm not sure about =
that. <br class=3D"">I can see that just the batch job might seem easier =
to implement, but this is only one part of the whole story. The =
operational side of using batch is more complex IMO. <br class=3D"">You =
need a tool to ingest your stream, you need storage for the ingested =
data, you need a periodic scheduler to kick of your batch job, and you =
need to take care of failures if something goes wrong. <br class=3D"">The =
streaming case, this is not needed or the framework does it for you.<br =
class=3D""><br class=3D""></div>Just my 2 cents, Fabian<br =
class=3D""></div><div class=3D"gmail_extra"><br class=3D""><div =
class=3D"gmail_quote">2017-05-16 20:58 GMT+02:00 Jain, Ankit <span =
dir=3D"ltr" class=3D"">&lt;<a href=3D"mailto:ankit.jain@here.com" =
target=3D"_blank" class=3D"">ankit.jain@here.com</a>&gt;</span>:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div bgcolor=3D"white" lang=3D"EN-US" class=3D"">
<div =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223m_-1915243680660377884WordSection1"><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D"">Hi Flavio,<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D"">While you wait on an update from Kostas, wanted to understand =
the use-case better and share my thoughts-<u class=3D""></u><u =
class=3D""></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-size:11pt;font-family:calibri" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223m_-1915243680660377884MsoListParagraph"><u =
class=3D""></u><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><span class=3D"">1)<span =
style=3D"font-style:normal;font-variant-ligatures:normal;font-variant-caps=
:normal;font-weight:normal;font-stretch:normal;font-size:7pt;line-height:n=
ormal;font-family:&quot;times new roman&quot;" =
class=3D"">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><u class=3D""></u><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">Why is current =
batch mode expensive? Where are you persisting the data after updates? =
Way I see it by moving to Flink, you get to use RocksDB(a key-value =
store) that makes your
 lookups faster =E2=80=93 probably right now you are using a non-indexed =
store like S3 maybe?<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223m_-1915243680660377884MsoListParagraph"><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">So, gain is =
coming from moving to a better persistence store suited to your use-case =
than from batch-&gt;streaming. Myabe consider just going with a =
different data store.<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223m_-1915243680660377884MsoListParagraph"><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">IMHO, stream =
should only be used if you really want to act on the new events in =
real-time. It is generally harder to get a streaming job correct than a =
batch one.<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223m_-1915243680660377884MsoListParagraph"><span =
style=3D"font-size:11pt;font-family:calibri" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223m_-1915243680660377884MsoListParagraph"><u =
class=3D""></u><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><span class=3D"">2)<span =
style=3D"font-style:normal;font-variant-ligatures:normal;font-variant-caps=
:normal;font-weight:normal;font-stretch:normal;font-size:7pt;line-height:n=
ormal;font-family:&quot;times new roman&quot;" =
class=3D"">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><u class=3D""></u><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">If current setup =
is expensive due to serialization-deserialization then that should be =
fixed by moving to a faster format (maybe AVRO? - I don=E2=80=99t have a =
lot of expertise in that).
 I don=E2=80=99t see how that problem will go away with Flink =E2=80=93 =
so still need to handle serialization.<u class=3D""></u><u =
class=3D""></u></span></p><p =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223m_-1915243680660377884MsoListParagraph"><span =
style=3D"font-size:11pt;font-family:calibri" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223m_-1915243680660377884MsoListParagraph"><u =
class=3D""></u><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><span class=3D"">3)<span =
style=3D"font-style:normal;font-variant-ligatures:normal;font-variant-caps=
:normal;font-weight:normal;font-stretch:normal;font-size:7pt;line-height:n=
ormal;font-family:&quot;times new roman&quot;" =
class=3D"">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><u class=3D""></u><span =
style=3D"font-size:11pt;font-family:calibri" class=3D"">Even if you do =
decide to move to Flink =E2=80=93 I think you can do this with one job, =
two jobs are not needed. At every incoming event, check the previous =
state and update/output to
 kafka or whatever data store you are using.<u class=3D""></u><u =
class=3D""></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-size:11pt;font-family:calibri" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><u class=3D""></u>&nbsp;<u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D"">Thanks<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D"">Ankit<u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri" =
class=3D""><u class=3D""></u>&nbsp;<u class=3D""></u></span></p>
<div =
style=3D"border-right:none;border-bottom:none;border-left:none;border-top:=
1pt solid rgb(181,196,223);padding:3pt 0in 0in" class=3D""><p =
class=3D"MsoNormal"><b class=3D""><span style=3D"font-family:calibri" =
class=3D"">From: </span>
</b><span style=3D"font-family:calibri" class=3D"">Flavio Pompermaier =
&lt;<a href=3D"mailto:pompermaier@okkam.it" target=3D"_blank" =
class=3D"">pompermaier@okkam.it</a>&gt;<br class=3D"">
<b class=3D"">Date: </b>Tuesday, May 16, 2017 at 9:31 AM<br class=3D"">
<b class=3D"">To: </b>Kostas Kloudas &lt;<a =
href=3D"mailto:k.kloudas@data-artisans.com" target=3D"_blank" =
class=3D"">k.kloudas@data-artisans.com</a>&gt;<br class=3D"">
<b class=3D"">Cc: </b>user &lt;<a href=3D"mailto:user@flink.apache.org" =
target=3D"_blank" class=3D"">user@flink.apache.org</a>&gt;<br class=3D"">
<b class=3D"">Subject: </b>Re: Stateful streaming question<u =
class=3D""></u><u class=3D""></u></span></p>
</div><div class=3D""><div =
class=3D"m_2521384099463255565m_3994663256654974888gmail-m_-82740811165056=
85147m_437579581663971223h5">
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Hi Kostas,<u class=3D""></u><u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal">thanks for your quick =
response.&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">I also thought about using Async =
IO, I just need to figure out how to correctly handle parallelism and =
number of async requests.&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">However that's probably the way =
to go..is it possible also to set a number of retry attempts/backoff =
when the async request fails (maybe due to a too busy server)?<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">For the second part I think it's =
ok to persist the state into RocksDB or HDFS, my question is indeed =
about that: is it safe to start reading (with another Flink job) from =
RocksDB or HDFS having an updatable state "pending" on it? Should
 I ensure that state updates are not possible until the other Flink job =
hasn't finish to read the persisted data?<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">And another question...I've tried =
to draft such a processand basically I have the following code:<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal">DataStream&lt;MyGroupedObj&gt; =
groupedObj =3D tuples.keyBy(0)<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
.flatMap(new RichFlatMapFunction&lt;Tuple4, MyGroupedObj&gt;() {<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; private transient ValueState&lt;MyGroupedObj&gt; state;<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; @Override<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; public void flatMap(Tuple4&nbsp;t, Collector&lt;MyGroupedObj&gt; =
out) throws Exception {<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; MyGroupedObj&nbsp;current =3D state.value();<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; if (current =3D=3D null) {<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; current =3D new MyGroupedObj();<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; }<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; ....<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp;current.addTuple(t);<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; ...&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; state.update(current);<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; out.collect(current);<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; }<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; @Override<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; public void open(Configuration config) {<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; ValueStateDescriptor&lt;MyGrouped<wbr class=3D"">Obj&gt; =
descriptor =3D<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; new =
ValueStateDescriptor&lt;&gt;( "test",TypeInformation.of(MyGr<wbr =
class=3D"">oupedObj.class));<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; state =3D getRuntimeContext().getState(d<wbr =
class=3D"">escriptor);<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; }<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; &nbsp; &nbsp; });<u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">&nbsp; &nbsp; =
groupedObj.print();<u class=3D""></u><u class=3D""></u></p>
</div>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">but obviously this way I emit the =
updated object on every update while, actually, I just want to persist =
the ValueState somehow (and make it available to another job that runs =
one/moth for example). Is that possible?<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal">On Tue, May 16, 2017 at 5:57 PM, =
Kostas Kloudas &lt;<a href=3D"mailto:k.kloudas@data-artisans.com" =
target=3D"_blank" class=3D"">k.kloudas@data-artisans.com</a>&gt; =
wrote:<u class=3D""></u><u class=3D""></u></p>
<blockquote =
style=3D"border-top:none;border-right:none;border-bottom:none;border-left:=
1pt solid rgb(204,204,204);padding:0in 0in 0in =
6pt;margin-left:4.8pt;margin-right:0in" class=3D"">
<div class=3D""><p class=3D"MsoNormal">Hi Flavio,<u class=3D""></u><u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">=46rom what I understand, for the =
first part you are correct. You can use Flink=E2=80=99s internal state =
to keep your enriched data.<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">In fact, if you are also querying =
an external system to enrich your data, it is worth looking at the =
AsyncIO feature:<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><a =
href=3D"https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/st=
ream/asyncio.html" target=3D"_blank" =
class=3D"">https://ci.apache.org/projects<wbr =
class=3D"">/flink/flink-docs-release-1.2/<wbr =
class=3D"">dev/stream/asyncio.html</a><u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Now for the second part, =
currently in Flink you cannot iterate over all registered keys for which =
you have state. A pointer&nbsp;<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">to look at the may be useful is =
the queryable state:<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><a =
href=3D"https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/st=
ream/queryable_state.html" target=3D"_blank" =
class=3D"">https://ci.apache.org/projects<wbr =
class=3D"">/flink/flink-docs-release-1.2/<wbr =
class=3D"">dev/stream/queryable_state.htm<wbr class=3D"">l</a><u =
class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">This is still an experimental =
feature, but let us know your opinion if you use it.<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Finally, an alternative would be =
to keep state in Flink, and periodically flush it to an external storage =
system, which you can<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">query at will.<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Thanks,<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Kostas<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D"">
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
<div class=3D"">
<blockquote style=3D"margin-top:5pt;margin-bottom:5pt" class=3D"">
<div class=3D""><p class=3D"MsoNormal">On May 16, 2017, at 4:38 PM, =
Flavio Pompermaier &lt;<a href=3D"mailto:pompermaier@okkam.it" =
target=3D"_blank" class=3D"">pompermaier@okkam.it</a>&gt; wrote:<u =
class=3D""></u><u class=3D""></u></p>
</div><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal">Hi to all,<u class=3D""></u><u =
class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal">we're still playing with Flink =
streaming part in order to see whether it can improve our current batch =
pipeline.<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">At the moment, we have a job that =
translate incoming data (as Row) into Tuple4, groups them together by =
the first field and persist the result to disk (using a thrift object). =
When we need to add tuples to those grouped objects we need
 to read again the persisted data, flat it back to Tuple4, union with =
the new tuples, re-group by key and finally persist.<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">This is very expansive to do with =
batch computation while is should pretty straightforward to do with =
streaming (from what I understood): I just need to use&nbsp;ListState. =
Right?<u class=3D""></u><u class=3D""></u></p>
</div>
<div class=3D""><p class=3D"MsoNormal">Then, let's say I need to scan =
all the data of the stateful computation (key and values), in order to =
do some other computation, I'd like to know:<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D"">
<ul type=3D"disc" class=3D"">
<li class=3D"MsoNormal">
how to do that? I.e. create a DataSet/DataSource&lt;Key,Value&gt; from =
the stateful data in the stream<u class=3D""></u><u =
class=3D""></u></li><li class=3D"MsoNormal">
is there any problem to access the stateful data without stopping =
incoming data (and thus possible updates to the states)?<u =
class=3D""></u><u class=3D""></u></li></ul>
<div class=3D""><p class=3D"MsoNormal">Thanks in advance for the =
support,<u class=3D""></u><u class=3D""></u></p>
</div>
</div>
<div class=3D""><p class=3D"MsoNormal">Flavio<u class=3D""></u><u =
class=3D""></u></p>
</div>
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
</div>
</div>
</div>
</blockquote>
</div><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div>
</div>
</div>
</div>
</blockquote>
</div><p class=3D"MsoNormal"><br class=3D"">
<br clear=3D"all" class=3D"">
<u class=3D""></u><u class=3D""></u></p>
<div class=3D""><p class=3D"MsoNormal"><u class=3D""></u>&nbsp;<u =
class=3D""></u></p>
</div><p class=3D"MsoNormal">-- <u class=3D""></u><u class=3D""></u></p>
<div class=3D"">
<div class=3D"">
<div class=3D"">
<div class=3D""><p class=3D"MsoNormal"><span =
style=3D"color:rgb(153,153,153)" class=3D"">Flavio Pompermaier<br =
class=3D"">
Development Department<br class=3D"">
<br class=3D"">
OKKAM S.r.l.<br class=3D"">
Tel. <a href=3D"tel:+39%200461%20182%203908" target=3D"_blank" =
class=3D"">+(39) 0461 1823908</a></span><u class=3D""></u><u =
class=3D""></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div></div></div>
</div>

</blockquote></div><br class=3D""></div>
</div></blockquote></div><br =
class=3D""></div></div></div></blockquote></div><br class=3D"">
</div></div></div>
</div></blockquote></div><br =
class=3D""></div></div></div></div></blockquote></div><br class=3D"">
</div></div>
</div></blockquote></div><br =
class=3D""></div></div></div></div></blockquote></div><br class=3D""><br =
class=3D"">
</div></div>
</div></blockquote></div><br class=3D""></div></body></html>=

--Apple-Mail=_0CD6FA1B-83B4-46CD-865D-802201C61234--