flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kostas Kloudas <k.klou...@data-artisans.com>
Subject Re: Stateful streaming question
Date Tue, 16 May 2017 15:57:55 GMT
Hi Flavio,

From what I understand, for the first part you are correct. You can use Flinkā€™s internal
state to keep your enriched data.
In fact, if you are also querying an external system to enrich your data, it is worth looking
at the AsyncIO feature:

https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html <https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html>

Now for the second part, currently in Flink you cannot iterate over all registered keys for
which you have state. A pointer 
to look at the may be useful is the queryable state:

https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/queryable_state.html
<https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/queryable_state.html>

This is still an experimental feature, but let us know your opinion if you use it.

Finally, an alternative would be to keep state in Flink, and periodically flush it to an external
storage system, which you can
query at will.

Thanks,
Kostas


> On May 16, 2017, at 4:38 PM, Flavio Pompermaier <pompermaier@okkam.it> wrote:
> 
> Hi to all,
> we're still playing with Flink streaming part in order to see whether it can improve
our current batch pipeline.
> At the moment, we have a job that translate incoming data (as Row) into Tuple4, groups
them together by the first field and persist the result to disk (using a thrift object). When
we need to add tuples to those grouped objects we need to read again the persisted data, flat
it back to Tuple4, union with the new tuples, re-group by key and finally persist.
> 
> This is very expansive to do with batch computation while is should pretty straightforward
to do with streaming (from what I understood): I just need to use ListState. Right?
> Then, let's say I need to scan all the data of the stateful computation (key and values),
in order to do some other computation, I'd like to know:
> how to do that? I.e. create a DataSet/DataSource<Key,Value> from the stateful data
in the stream
> is there any problem to access the stateful data without stopping incoming data (and
thus possible updates to the states)?
> Thanks in advance for the support,
> Flavio
> 


Mime
View raw message