flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tzu-Li (Gordon) Tai" <tzuli...@apache.org>
Subject Re: Best practices to maintain reference data for Flink Jobs
Date Fri, 19 May 2017 05:43:12 GMT

Can the enriching data be keyed? Or is it something that has to be broadcasted to each operator?
Either way, I think Side Inputs (an upcoming feature in the future) is the best fit for this.
You can take a look at https://issues.apache.org/jira/browse/FLINK-6131.

Regarding the 3 options you listed:

By using QueryableState in option B, what you mean is that you want to feed the enriching
data stream to a separate job, let that job allow queryable state, and query that state from
the actual application job operators, correct? If so, I think options A and B would mean the
same thing; i.e., they require accessing data external to the job. 

If the enriching data can somehow be keyed with the stream that requires it, I would go for
option C using connected streams, with the enriching data as one input and the actual data
as the other. Instead of just “caching the enriching data in memory”, you should register
it as a managed Link state for the CoMapFunction / CoFlatMapFunction. The actual input stream
records can just access that registered state locally.


On 19 May 2017 at 7:11:07 AM, Sand Stone (sand.m.stone@gmail.com) wrote:

Hi. Say I have a few reference data sets need to be used for a  
streaming job. The sizes range between 10M-10GB. The data is not  
static, will be refreshed at minutes and/or day intervals.  

With the new advancements in Flink, it seems there are quite a few options.  
A. Store all the data in an external (kv) database cluster. And use  
async io calls  
* data refresh can be done in a few different ways  
B. Use the new Querytable State feature  
* it seems there is no "easy" API to discover the  
queryable state at the moment. Need to use the restful API to figure  
out the job id.  
C. Ingest the reference data into the job and cache them in memory  
Any other option?  

On paper, it seems option B with the Queryable State is the cleanest solution.  

Any comment/suggestion is greatly appreciated in particular in terms  
of robustness and consistent recovery.  

Thanks much!  

View raw message