beam-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kenneth Knowles <>
Subject Re: How can runners make use of sink parallelism?
Date Tue, 03 Apr 2018 22:11:58 GMT
The runner should generally not need to be aware of any getNumShard() API
on a connector. The connector itself is probably a composite transform
(with a ParDo or two or three somewhere doing the actual writes) and should
be designed to expose available parallelism. Specifying the number of
shards actually usually limits the parallelism, versus letting the runner
use the maximum allowed parallelism.

If the connector does a GBK to gather input elements into a single
iterable, then it is a single element and cannot be processed in parallel
(except through splittable DoFn, but in that case you may not need to do
the GBK in the first place). And converse to that, if the connector does
not do a GBK to gather input elements, then the runner is permitted to
bundle them any way it wants and process all of them as though in parallel
(except for stateful DoFn, in which case you probably don't need the GBK).

Bundling is an important way that this works, too, since the @FinishBundle
method is really a "flush" method, with @ProcessElement perhaps buffering
up elements to be written to e.g. the same file shard. It is not this
simple in practice but that gives the idea of how even with unrestricted
elementwise parallelism you don't get one shard per element.

These are all just ideas, and I'm not the connector expert. But I think the
TL;DR is that a runner shouldn't need to know this - have you hit specific
issues with a particular connector? That could make this a very productive


On Mon, Apr 2, 2018 at 1:41 PM Shen Li <> wrote:

> Hi,
> It seems that there is no Sink base class. Some IO connectors (e.g.,
> KafkaIO and TextIO) provide a getNumShard() API. But it is not generally
> available for all existing Beam IO connectors and potential custom ones. Although
> some IO connectors are implemented using ParDo/GBK, it is unclear whether
> the runner can directly parallelize those transforms (e.g., what if it only
> writes to a single file). Is there a general way for runners to take
> advantage of sink parallelism?
> Thanks,
> Shen

View raw message