beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anant Bhandarkar (JIRA)" <>
Subject [jira] [Commented] (BEAM-2208) Python SDK wordcount on cloud Dataflow runner is slow
Date Tue, 09 May 2017 06:22:04 GMT


Anant Bhandarkar commented on BEAM-2208:

[~altay] This word count job was run yesterday.

We tried to increase the number of worker instance to 50  instead of autoscale but it only
took max 2 workers and took 34 min 54 sec to execute.

Wondering what will ensure that the work is distributed among the workers also what will bring
about such difference in execution times compared to Java in a word count scenario.

> Python SDK wordcount on cloud Dataflow runner is slow
> -----------------------------------------------------
>                 Key: BEAM-2208
>                 URL:
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-dataflow, sdk-py
>    Affects Versions: 0.6.0
>            Reporter: Anant Bhandarkar
>            Assignee: Ahmet Altay
>            Priority: Critical
> I have been trying to run the Beam Word count example with a 2GB file.
> When I run the Java Example for word count of this csv file the job gets completed in
7.15secs Mins.
> Job ID	
> 2017-04-18_23_57_02-2832613177376293063
> But word count example with same file using Python SDK takes 28 to 35mins 2017-04-20_04_48_27-8924552896141769408
> SDK version	
> Apache Beam SDK for Python 0.6.0

This message was sent by Atlassian JIRA

View raw message