beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmet Altay (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-1442) Performance improvement of the Python DirectRunner
Date Mon, 27 Feb 2017 18:08:45 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886245#comment-15886245
] 

Ahmet Altay commented on BEAM-1442:
-----------------------------------

It is correct that python DirectRunner will utilize only one core. It actually has a multi-threaded
implementation and the executor will run the pipeline using a thread pool. The problem is,
due to GIL the threads will not run in parallel. In order to get real parallelism the DirectRunner
need to be converted using multiple processes. 

multiprocessing module is well suited for this. However, another important thing here is that
the data will be serialized across processes. This has two implications, 1. DirectRunner needs
to be updated in a way that work items are picklable. 2. We need to pay attention to serializations
costs. Perhaps using a mix of per-procees Queue's and a master work Queue. 

Another important point is that, some work needs to happen in a single process (e.g. GroupByKey),
to cover such cases there needs to be a concept of process affinity.

> Performance improvement of the Python DirectRunner
> --------------------------------------------------
>
>                 Key: BEAM-1442
>                 URL: https://issues.apache.org/jira/browse/BEAM-1442
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py
>            Reporter: Pablo Estrada
>            Assignee: Ahmet Altay
>              Labels: gsoc2017, mentor, python
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, and correctness
checkers for Beam pipelines; but there are users that run data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although some work
has gone into improving it. There are more opportunities for improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, and study
the `Pipeline.run` and `DirectRunner.run` methods. Ask questions directly on JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message