beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Demeshchuk <>
Subject Practices for running Python projects on Dataflow
Date Mon, 05 Jun 2017 20:56:06 GMT
Hi list,

Suppose, you have a private Python package that contains some code people
want to be sharing when writing their pipelines.

So, typically, the installation process of the package would be either

pip install git+ssh://


git clone git://
python mypackage/

Now, the problem starts when we want to get that package into Dataflow.
Right now, to my understanding, DataflowRunner supports 3 approaches:


   Specifying a requirements_file parameter in the pipeline options. This
   basically must be a requirements.txt file.

   Specifying an extra_packages parameter in the pipeline options. This
   must be a list of tarballs, each of which contains a Python package
   packaged using distutils.

   Specifying a setup_file parameter in the pipeline options. This will
   just run the python path/to/my/ package command and then send
   the files over the wire.

The best approach I could come up with was including an *additional* into the package itself, so that when we install that package, the file gets installed along with it. And then, I’d point the
setup_file option to that file.

This gist
<> shows
the basic approach in code. Both and are supposed to be
present in the installed package.

It kind of works for me, with some caveats, but I just wanted to find out
if it’s a more decent way to handle my situation. I’m not keen on
specifying that private package as a git dependency, because of having to
worry about git credentials, but maybe there are other ways?

Best regards,
Dmitry Demeshchuk.

View raw message