beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Demeshchuk <dmi...@postmates.com>
Subject Practices for running Python projects on Dataflow
Date Mon, 05 Jun 2017 20:56:06 GMT
Hi list,

Suppose, you have a private Python package that contains some code people
want to be sharing when writing their pipelines.

So, typically, the installation process of the package would be either

pip install git+ssh://git@github.com/mycompany/mypackage#egg=mypackage

or

git clone git://git@github.com/mycompany/mypackage
python setup.py mypackage/setup.py

Now, the problem starts when we want to get that package into Dataflow.
Right now, to my understanding, DataflowRunner supports 3 approaches:

   1.

   Specifying a requirements_file parameter in the pipeline options. This
   basically must be a requirements.txt file.
   2.

   Specifying an extra_packages parameter in the pipeline options. This
   must be a list of tarballs, each of which contains a Python package
   packaged using distutils.
   3.

   Specifying a setup_file parameter in the pipeline options. This will
   just run the python path/to/my/setup.py package command and then send
   the files over the wire.

The best approach I could come up with was including an *additional*
setup.py into the package itself, so that when we install that package, the
setup.py file gets installed along with it. And then, I’d point the
setup_file option to that file.

This gist
<https://gist.github.com/doubleyou/be01226352372491babda7602022c506> shows
the basic approach in code. Both setup.py and options.py are supposed to be
present in the installed package.

It kind of works for me, with some caveats, but I just wanted to find out
if it’s a more decent way to handle my situation. I’m not keen on
specifying that private package as a git dependency, because of having to
worry about git credentials, but maybe there are other ways?

Thanks!
​
-- 
Best regards,
Dmitry Demeshchuk.

Mime
View raw message