beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukasz Cwik <>
Subject Re: Best way to inject external per key config into the pipeline
Date Thu, 05 Oct 2017 17:58:53 GMT
Yes, everytime you start the pipeline you need to read in all the config
data. You can do this by flattening a bounded source which reads the
current state of config data with a streaming source that gets updates and
then use that as a side input. On the other hand, with a few million keys,
using an in memory cache with your own refresh policy that contacts your
datastore would also likely work well.

On Thu, Oct 5, 2017 at 10:16 AM, Yihua Fang <>

> I thought about streaming the updates and using side input, but since the
> config data are not persisted in the data pipeline, how would the config
> being populated in the first place. Is the solution to populate through
> streaming every time the pipeline is rebooted?
> 1. The config data can be a few minutes to a few hours late. No need to
> immediately reflects the config.
> 2. The config data shouldn't be change often. It is configured by human
> users.
> 3. The config data per key should be about 10-20 key value pairs.
> 4. Ideally the key number is in the range of a few millions, but a few
> thousands to begin with.
> Thanks
> Eric
> On Thu, Oct 5, 2017 at 9:09 AM Lukasz Cwik <> wrote:
>> Can you stream the updates to the keys into the pipeline and then use it
>> as a side input performing a join against on your main stream that needs
>> the config data?
>> You could also use an in memory cache that periodically refreshes keys
>> from the external source.
>> A better answer depends on:
>> * how stale can the config data be?
>> * how often does the config data change?
>> * how much config data you expect to have per key?
>> * how many keys do you expect to have?
>> On Wed, Oct 4, 2017 at 5:41 PM, Yihua Fang <>
>> wrote:
>>> Hi,
>>> The use case is that I have an external source that store a
>>> configuration for each key accessible via restful APIs and Beam pipeline
>>> should use the config to process each element for each key. What is the
>>> best approach to facilitate injecting the latest config into the pipeline?
>>> Thanks
>>> Eric

View raw message