flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: In-Memory Lookup in Flink Operators
Date Sat, 29 Sep 2018 23:47:35 GMT
Hi Lasse,

One approach I’ve used in a similar situation is to have a “UnionedSource” wrapper that
first emits the (bounded) data that will be loaded in-memory, and then starts running the
source that emits the continuous stream of data.

This outputs an Either<A, B>, which I then split, and broadcast the A, and key/partition
the B.

You could do something similar, but occasionally keep checking if there’s more <A>
data vs. assuming it’s bounded.

The main issue I ran into is that it doesn’t seem possible to do checkpointing, or at least
I couldn’t think of a way to make this work properly.

— Ken

> On Sep 27, 2018, at 9:50 PM, Lasse Nedergaard <lassenedergaard@gmail.com> wrote:
> Hi. 
> We have created our own database source that pools the data with a configured interval.
We then use a co processed function. It takes to input one from our database and one from
our data input. I require that you keyby with the attributes you use lookup in your map function.

> To delay your data input until your database lookup is done first time is not simple
but a simple solution could be to implement a delay operation or keep the data in your process
function until data arrive from your database stream. 
> Med venlig hilsen / Best regards
> Lasse Nedergaard
> Den 28. sep. 2018 kl. 06.28 skrev Chirag Dewan <chirag.dewan22@yahoo.in <mailto:chirag.dewan22@yahoo.in>>:
>> Hi,
>> I saw Apache Flink User Mailing List archive. - static/dynamic lookups in flink streaming
being discussed, and then I saw this FLIP https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API

>> I know we havent made much progress on this topic. I still wanted to put forward
my problem statement around this. 
>> I am also looking for a dynamic lookup in Flink operators. I actually want to pre-fetch
various Data Sources, like DB, Filesystem, Cassandra etc. into memory. Along with that, I
have to ensure a refresh of in-memory lookup table periodically. The period being a configurable
>> This is what a map operator would look like with lookup: 
>> -> Load in-memory lookup - Refresh timer start
>> -> Stream processing start
>> -> Call lookup
>> -> Use lookup result in Stream processing
>> -> Timer elapsed -> Reload lookup data source into in-memory table
>> -> Continue processing
>>  My concern around these are : 
>> 1) Possibly storing the same copy of data in every Task slots memory or state backend(RocksDB
in my case).
>> 2) Having a dedicated refresh thread for each subtask instance(possibly, every Task
Manager having multiple refresh thread)
>> Am i thinking in the right direction? Or missing something very obvious? It confusing.
>> Any leads are much appreciated. Thanks in advance.
>> Cheers, 
>> Chirag

Ken Krugler
+1 530-210-6378
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

View raw message