flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaël Renoux <gael.ren...@datadome.co>
Subject Re: Best way to link static data to event data?
Date Mon, 30 Sep 2019 10:03:10 GMT
Hi John,

I've had a similar requirement, and I've resorted to simply use a static
cache (I'm coding in Scala, so that's a lazy value on a singleton object -
in Java that would be a static value on some utility class, with a
synchronized lazy-loading getter). The value is reloaded after some
duration, which adds a small latency at regular intervals. Keep in mind
that one instance of that value will be loaded on each task manager
(provided that at least one task running on that task manager calls the

If you're OK with restarting the job when your data changes, it would be
better to load it on start (no need to synchronize stuff). Just load it
inside your job initialization code (it will be executed within the job
manager) and pass that data as a parameter to your operator's constructor.
The data format must be serializable.


On Sat, Sep 28, 2019 at 2:18 AM Sameer Wadkar <sameer@axiomine.com> wrote:

> The main consideration in these type of scenarios is not the type of
> source function you use. The key point is how does the event operator get
> the slow moving master data and cache it. And then recover it if it fails
> and restarts again.
> It does not matter that the csv file does not change often. It is possible
> that the event operator may fail and restart. The csv data needs to made
> available to it again.
> In that scenario the initial suggestion I made to pass the csv data in the
> constructor is not adequate by itself. You need to store it in the operator
> state which allows it to recover it when it restarts  on failure.
> As long as the above takes place you have resiliency and you can use any
> suitable method or source. I have not used Table source as much but
> connected streams and operator state has worked out for me in similar
> scenarios.
> Sameer
> Sent from my iPhone
> On Sep 27, 2019, at 4:38 PM, John Smith <java.dev.mtl@gmail.com> wrote:
> It's a fairly small static file that may update once in a blue moon lol
> But I'm hopping to use existing functions. Why can't I just use CSV to
> table source?
> Why should I have to now either write my own CSV parser or look for 3rd
> party, then what put in a Java Map and lookup that map? I'm finding Flink
> to be a bit of death by 1000 paper cuts lol
> if i put the CSV in a table I can then use it to join across it with the
> event no?
> On Fri, 27 Sep 2019 at 16:25, Sameer W <sameer@axiomine.com> wrote:
>> Connected Streams is one option. But may be an overkill in your scenario
>> if your CSV does not refresh. If your CSV is small enough (number of
>> records wise), you could parse it and load it into an object (serializable)
>> and pass it to the constructor of the operator where you will be streaming
>> the data.
>> If the CSV can be made available via a shared network folder (or S3 in
>> case of AWS) you could also read it in the open function (if you use Rich
>> versions of the operator).
>> The real problem I guess is how frequently does the CSV update. If you
>> want the updates to propagate in near real time (or on schedule) the option
>> 1  ( parse in driver and send it via constructor does not work). Also in
>> the second option you need to be responsible for refreshing the file read
>> from the shared folder.
>> In that case use Connected Streams where the stream reading in the file
>> (the other stream reads the events) periodically re-reads the file and
>> sends it down the stream. The refresh interval is your tolerance of stale
>> data in the CSV.
>> On Fri, Sep 27, 2019 at 3:49 PM John Smith <java.dev.mtl@gmail.com>
>> wrote:
>>> I don't think I need state for this...
>>> I need to load a CSV. I'm guessing as a table and then filter my events
>>> parse the number, transform the event into geolocation data and sink that
>>> downstream data source.
>>> So I'm guessing i need a CSV source and my Kafka source and somehow join
>>> those transform the event...
>>> On Fri, 27 Sep 2019 at 14:43, Oytun Tez <oytun@motaword.com> wrote:
>>>> Hi,
>>>> You should look broadcast state pattern in Flink docs.
>>>> ---
>>>> Oytun Tez
>>>> *M O T A W O R D*
>>>> The World's Fastest Human Translation Platform.
>>>> oytun@motaword.com — www.motaword.com
>>>> On Fri, Sep 27, 2019 at 2:42 PM John Smith <java.dev.mtl@gmail.com>
>>>> wrote:
>>>>> Using 1.8
>>>>> I have a list of phone area codes, cities and their geo location in
>>>>> CSV file. And my events from Kafka contain phone numbers.
>>>>> I want to parse the phone number get it's area code and then associate
>>>>> the phone number to a city, geo location and as well count how many numbers
>>>>> are in that city/geo location.

Gaël Renoux
Senior R&D Engineer, DataDome
M +33 6 76 89 16 52  <+33+6+76+89+16+52>
E gael.renoux@datadome.co  <gael.renoux@datadome.co>
W www.datadome.co

[image: Read DataDome reviews on G2]

View raw message