flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philippe CAPARROY <philippe.capar...@orange.fr>
Subject re: Elegantly sharing state in a streaming environment
Date Mon, 30 May 2016 12:31:55 GMT

Just transform the list in a DataStream. A datastream can be finite.


One solution, in the context of a Streaming environment is to use Kafka, or any other distributed
broker, although Flink ships with a KafkaSource.

 

1)Create a Kafka Topic dedicated to your list of key/values. Inject your values into this
topic, partitionned by the keys. So that you recover the keys in Flink.

 

2) Create a source for the stream of tuple your analysing -> output1 (Tuples).

 

3) Create a KafkaSource, and parse/recover your key value pairs from this source (e.g a first
map operator) : map1 -> output 2 (K,V), then :

 

 

 

                 a)  If you need all key/Value pairs at each operator : 
broadcast all partitions from the output 1 to the analysis operator

 

                  b) if you dont need all key/values pairs, just chain output1
to the analysis operator. Partitioning of K,V pairs will depend on Kafka partitioning strategy,
and can be controlled in Flink      anyway.

 

4) The analysis operator :  will perform a RichCoFlatMapFunction, and can be Checkpointed.

When receiving K,V pairs from output2, store them in a local state.

When receiving tuple, should be able to to filter with the help of the local state, and propagate
downstream or not.

 

 

 

 

 

 

 

 

 

 

 

> Message du 30/05/16 13:41
> De : leon_mclare@tutanota.com
> A : "User" 
> Copie à : 
> Objet : Elegantly sharing state in a streaming environment
> 
>Hello Flink team,

How can i partition and share static state among instances of a streaming operator? 

I have a huge list of keys and values, which are used to filter tuples in a stream. The list
does not change. Currently i am sharing the list with each operator instance via the constructor,
although only a subset of the list is required per operator (the assignment of subset to operator
instance is known). I cannot use DataSet based functions in a streaming execution environment
to assign sub lists. I also cannot use DataStream based partitioning functions as the list
is static, i.e. not a DataStream. The dilemma exists as i am mixing static (DataSet type)
content with streaming content. Is there any other approach aside from using an additional
tool (e.g. distributed cache)?

Thanks in advance.

Regards
Leon




Mime
View raw message