flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Lappo <pe...@systematicmethods.com>
Subject ETL with changing reference data
Date Sun, 10 Sep 2017 21:59:59 GMT
We are building an ETL style application in Flink that consumes records from a file or a message
bus as a DataStream. We are transforming records using SQL and UDFs. The UDF loads reference
data in the open method and currently the data loaded remains in memory until the job is cancelled.
The eval method of the UDF is used to do the actual transformation on a particular field.
So of course reference data changes and data will need to reprocessed. Lets assume we can
identify and resubmit records for reprocessing what is the best design that
* keeps the Flink job running
* reloads the changed reference data
so that records are reprocessed in a deterministic fashion

Two options spring to mind
1) send a control record to the stream that reloads reference data or part of it and ensure
resubmitted records are processed after the reload message
2) use a separate thread to poll the reference data source and reload any changed data which
will of course suffer from race conditions

Or is there a better way of solving this type of problem with Flink?

View raw message