beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <>
Subject [jira] [Commented] (BEAM-68) Support for limiting parallelism of a step
Date Thu, 31 Mar 2016 16:00:27 GMT


Eugene Kirpichov commented on BEAM-68:

I re-duped BEAM-169 against BEAM-92.
- I agree that it won't work with empty K, however that should be relatively unlikely if the
user has enough data that sharding it makes sense, and if their hash function is good; and
in some sharded sink scenarios it may be desirable to not write empty shards.
- I'm not sure what you mean by "won't scale": individual shards have to be written sequentially,
but they can be written in parallel with each other in this proposed implementation: dynamic
rebalancing will separate the shard keys from each other.

> Support for limiting parallelism of a step
> ------------------------------------------
>                 Key: BEAM-68
>                 URL:
>             Project: Beam
>          Issue Type: New Feature
>          Components: beam-model
>            Reporter: Daniel Halperin
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected QPS/step) on
the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. A GroupByKey
with exactly k keys will guarantee that only k elements are produced, but runners are free
to break fusion in ways that each element may be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the Beam Model.

This message was sent by Atlassian JIRA

View raw message