beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukasz Cwik <>
Subject Re: Providing HTTP client to DoFn
Date Thu, 06 Jul 2017 15:38:41 GMT
#1: For all runners, the side input needs to be ready (data needs to exist
for the given window) before the main input is executed which means that in
your case the whole side input will be materialized before the main input
is executed.

#2: For Dataflow, a map/multimap based side input is loaded lazily in parts
based upon which key is being accessed. Each segment of the map is cached
in memory (using an LRU policy) and the loading the data remotely is the
largest cost in such a system. Depending on how large your main input is,
performing a group by key on your access key will speed up your lookups
(because you'll get a lot more cache hits) but you have to weight the cost
of doing the GBK vs speed up in side input usage.

What do you mean by "expanding the tuples to the expanded data"?
* Are you trying to say that typically you'll look up the same value 100+
times from the side input
** In this case performing a GBK based upon your lookup key may be of
* Are you trying to say that you could have the data stored within the side
input instead of just the index but it would be 100 times larger?
** A map based side input which has values which are 4 bytes vs 400 bytes
isn't going to change much in lookup cost

On Wed, Jul 5, 2017 at 6:22 PM, Randal Moore <> wrote:

> Based on my understanding so far, I'm targeting Dataflow with a batch
> pipeline. Just starting to experiment with the setup/teardown with the
> local runner - that might work fine.
> Somewhat intrigued with the side inputs, though.  The pipeline might
> iterate over 1,000,000 tuples of two integers.  The integers are indices
> into a database of data. A given integer will be repeated in the inputs
> many times.  Am I prematurely optimizing to rule out expanding the tuples
> to the expanded data as each value might be expanded 100 or more times? As
> side inputs, it might expand to ~100GB.  Expanding the input would be
> significantly bigger.
> #1 how does Dataflow schedule the pipeline with a map side input - does it
> wait until the whole map is collected?
> #2 can the DoFn specify that it depends on only specific keys of the side
> input map?  does that affect the scheduling of the DoFn?
> Thanks for any pointers...
> rdm
> On Wed, Jul 5, 2017 at 4:58 PM Lukasz Cwik <> wrote:
>> That should have said:
>> ~100s MiBs per window in streaming pipelines
>> On Wed, Jul 5, 2017 at 2:58 PM, Lukasz Cwik <> wrote:
>>> #1, side inputs supported sizes and performance are specific to a
>>> runner. For example, I know that Dataflow supports side inputs which are 1+
>>> TiB (aggregate) in batch pipelines and ~100s MiBs per window because there
>>> have been several one off benchmarks/runs. What kinds of sizes/use case do
>>> you want to support, some runners will do a much better job with really
>>> small side inputs while others will be better with really large side inputs?
>>> #2, this depends on which library your using to perform the REST calls
>>> and whether it is thread safe. DoFns can be shared across multiple bundles
>>> and can contain methods marked with @Setup/@Teardown which only get invoked
>>> once per DoFn instance (which is relatively infrequently) and you could
>>> store an instance per DoFn instead of a singleton if the REST library was
>>> not thread safe.
>>> On Wed, Jul 5, 2017 at 2:45 PM, Randal Moore <>
>>> wrote:
>>>> I have a step in my beam pipeline that needs some data from a rest
>>>> service. The data acquired from the rest service is dependent on the
>>>> context of the data being processed and relatively large. The rest client
>>>> am using isn't serializable - nor is it likely possible to make it so
>>>> (background threads, etc.).
>>>> #1 What are the practical limits to the size of side inputs (e.g., I
>>>> could try to gather all the data from the rest service and provide it as
>>>> side-input)?
>>>> #2 Assuming that using the rest client is the better option, would a
>>>> singleton instance be safe way to instantiate the rest client?
>>>> Thanks,
>>>> rdm

View raw message