beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukasz Cwik <>
Subject Re: cores and partitions in DataFlow
Date Fri, 14 Sep 2018 15:53:47 GMT
Dataflow has logical partitions of work and relies on auto-scaling and
dynamic work rebalancing to distribute and redistribute work. Typically
machine size vs number of machines shouldn't matter unless you run really
small or very large jobs since there is no point in running a job with a
machine that has 32 cores if it is very short lived. Depending on your job
though, things like amount of ram per CPU can matter if your job processes
very large elements (for example genome sequences) or buffers a lot in

On Thu, Sep 13, 2018 at 6:34 PM <>

> Like Spark has 2 levels of processing
> a) across different worker.
> b) Within same Executor - multiple cores can work on different partitions.
> I know in Apache Beam with DataFlow as Runner - partitioning is
> abstracted. But does Dataflow uses multiple cores to process different
> partitions at same time.
> Objective is to understand what machines should be used to run Pipelines.
> Does one should give a thought about cores on machine or does it not matter
> ?
> Thanks
> Aniruddh

View raw message