flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Smith <java.dev....@gmail.com>
Subject Re: Best way to link static data to event data?
Date Mon, 30 Sep 2019 17:35:53 GMT
Ok thanks. It's basically telephone area codes, they barely ever change.

On Mon, 30 Sep 2019 at 06:03, Gaël Renoux <gael.renoux@datadome.co> wrote:

> Hi John,
> I've had a similar requirement, and I've resorted to simply use a static
> cache (I'm coding in Scala, so that's a lazy value on a singleton object -
> in Java that would be a static value on some utility class, with a
> synchronized lazy-loading getter). The value is reloaded after some
> duration, which adds a small latency at regular intervals. Keep in mind
> that one instance of that value will be loaded on each task manager
> (provided that at least one task running on that task manager calls the
> getter).
> If you're OK with restarting the job when your data changes, it would be
> better to load it on start (no need to synchronize stuff). Just load it
> inside your job initialization code (it will be executed within the job
> manager) and pass that data as a parameter to your operator's constructor.
> The data format must be serializable.
> Gaël
> On Sat, Sep 28, 2019 at 2:18 AM Sameer Wadkar <sameer@axiomine.com> wrote:
>> The main consideration in these type of scenarios is not the type of
>> source function you use. The key point is how does the event operator get
>> the slow moving master data and cache it. And then recover it if it fails
>> and restarts again.
>> It does not matter that the csv file does not change often. It is
>> possible that the event operator may fail and restart. The csv data needs
>> to made available to it again.
>> In that scenario the initial suggestion I made to pass the csv data in
>> the constructor is not adequate by itself. You need to store it in the
>> operator state which allows it to recover it when it restarts  on failure.
>> As long as the above takes place you have resiliency and you can use any
>> suitable method or source. I have not used Table source as much but
>> connected streams and operator state has worked out for me in similar
>> scenarios.
>> Sameer
>> Sent from my iPhone
>> On Sep 27, 2019, at 4:38 PM, John Smith <java.dev.mtl@gmail.com> wrote:
>> It's a fairly small static file that may update once in a blue moon lol
>> But I'm hopping to use existing functions. Why can't I just use CSV to
>> table source?
>> Why should I have to now either write my own CSV parser or look for 3rd
>> party, then what put in a Java Map and lookup that map? I'm finding Flink
>> to be a bit of death by 1000 paper cuts lol
>> if i put the CSV in a table I can then use it to join across it with the
>> event no?
>> On Fri, 27 Sep 2019 at 16:25, Sameer W <sameer@axiomine.com> wrote:
>>> Connected Streams is one option. But may be an overkill in your scenario
>>> if your CSV does not refresh. If your CSV is small enough (number of
>>> records wise), you could parse it and load it into an object (serializable)
>>> and pass it to the constructor of the operator where you will be streaming
>>> the data.
>>> If the CSV can be made available via a shared network folder (or S3 in
>>> case of AWS) you could also read it in the open function (if you use Rich
>>> versions of the operator).
>>> The real problem I guess is how frequently does the CSV update. If you
>>> want the updates to propagate in near real time (or on schedule) the option
>>> 1  ( parse in driver and send it via constructor does not work). Also in
>>> the second option you need to be responsible for refreshing the file read
>>> from the shared folder.
>>> In that case use Connected Streams where the stream reading in the file
>>> (the other stream reads the events) periodically re-reads the file and
>>> sends it down the stream. The refresh interval is your tolerance of stale
>>> data in the CSV.
>>> On Fri, Sep 27, 2019 at 3:49 PM John Smith <java.dev.mtl@gmail.com>
>>> wrote:
>>>> I don't think I need state for this...
>>>> I need to load a CSV. I'm guessing as a table and then filter my events
>>>> parse the number, transform the event into geolocation data and sink that
>>>> downstream data source.
>>>> So I'm guessing i need a CSV source and my Kafka source and somehow
>>>> join those transform the event...
>>>> On Fri, 27 Sep 2019 at 14:43, Oytun Tez <oytun@motaword.com> wrote:
>>>>> Hi,
>>>>> You should look broadcast state pattern in Flink docs.
>>>>> ---
>>>>> Oytun Tez
>>>>> *M O T A W O R D*
>>>>> The World's Fastest Human Translation Platform.
>>>>> oytun@motaword.com — www.motaword.com
>>>>> On Fri, Sep 27, 2019 at 2:42 PM John Smith <java.dev.mtl@gmail.com>
>>>>> wrote:
>>>>>> Using 1.8
>>>>>> I have a list of phone area codes, cities and their geo location
>>>>>> CSV file. And my events from Kafka contain phone numbers.
>>>>>> I want to parse the phone number get it's area code and then
>>>>>> associate the phone number to a city, geo location and as well count
>>>>>> many numbers are in that city/geo location.
> --
> Gaël Renoux
> Senior R&D Engineer, DataDome
> M +33 6 76 89 16 52  <+33+6+76+89+16+52>
> E gael.renoux@datadome.co  <gael.renoux@datadome.co>
> W www.datadome.co
> <http://www.datadome.co?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
> <https://www.facebook.com/datadome/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
> <https://fr.linkedin.com/company/datadome?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
> <https://twitter.com/data_dome?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
> [image: Read DataDome reviews on G2]
> <https://www.g2.com/products/datadome/reviews?utm_source=review-widget&utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

View raw message