beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukasz Cwik <lc...@google.com>
Subject Re: Best way to inject external per key config into the pipeline
Date Thu, 05 Oct 2017 17:58:53 GMT
Yes, everytime you start the pipeline you need to read in all the config
data. You can do this by flattening a bounded source which reads the
current state of config data with a streaming source that gets updates and
then use that as a side input. On the other hand, with a few million keys,
using an in memory cache with your own refresh policy that contacts your
datastore would also likely work well.

On Thu, Oct 5, 2017 at 10:16 AM, Yihua Fang <fang.yihua.eric@gmail.com>
wrote:

> I thought about streaming the updates and using side input, but since the
> config data are not persisted in the data pipeline, how would the config
> being populated in the first place. Is the solution to populate through
> streaming every time the pipeline is rebooted?
>
> 1. The config data can be a few minutes to a few hours late. No need to
> immediately reflects the config.
> 2. The config data shouldn't be change often. It is configured by human
> users.
> 3. The config data per key should be about 10-20 key value pairs.
> 4. Ideally the key number is in the range of a few millions, but a few
> thousands to begin with.
>
> Thanks
> Eric
>
> On Thu, Oct 5, 2017 at 9:09 AM Lukasz Cwik <lcwik@google.com> wrote:
>
>> Can you stream the updates to the keys into the pipeline and then use it
>> as a side input performing a join against on your main stream that needs
>> the config data?
>> You could also use an in memory cache that periodically refreshes keys
>> from the external source.
>>
>> A better answer depends on:
>> * how stale can the config data be?
>> * how often does the config data change?
>> * how much config data you expect to have per key?
>> * how many keys do you expect to have?
>>
>>
>> On Wed, Oct 4, 2017 at 5:41 PM, Yihua Fang <fang.yihua.eric@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> The use case is that I have an external source that store a
>>> configuration for each key accessible via restful APIs and Beam pipeline
>>> should use the config to process each element for each key. What is the
>>> best approach to facilitate injecting the latest config into the pipeline?
>>>
>>> Thanks
>>> Eric
>>>
>>
>>

Mime
View raw message