beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From al...@apache.org
Subject [1/3] beam-site git commit: Add new Managing Python Pipeline Dependencies page
Date Fri, 10 Feb 2017 20:05:49 GMT
Repository: beam-site
Updated Branches:
  refs/heads/asf-site 703e0bb2b -> a1e2a39f6


Add new Managing Python Pipeline Dependencies page


Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/2dd2c59c
Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/2dd2c59c
Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/2dd2c59c

Branch: refs/heads/asf-site
Commit: 2dd2c59cbc4b9d45fd542c4732c26ac622e463b6
Parents: 703e0bb
Author: melissa <melissapa@google.com>
Authored: Fri Feb 3 18:44:19 2017 -0800
Committer: Ahmet Altay <altay@google.com>
Committed: Fri Feb 10 12:04:23 2017 -0800

----------------------------------------------------------------------
 .../sdks/python-pipeline-dependencies.md        | 106 +++++++++++++++++++
 src/documentation/sdks/python.md                |   3 +
 2 files changed, 109 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/beam-site/blob/2dd2c59c/src/documentation/sdks/python-pipeline-dependencies.md
----------------------------------------------------------------------
diff --git a/src/documentation/sdks/python-pipeline-dependencies.md b/src/documentation/sdks/python-pipeline-dependencies.md
new file mode 100644
index 0000000..916a9b5
--- /dev/null
+++ b/src/documentation/sdks/python-pipeline-dependencies.md
@@ -0,0 +1,106 @@
+---
+layout: default
+title: "Managing Python Pipeline Dependencies"
+permalink: /documentation/sdks/python-pipeline-dependencies/
+---
+# Managing Python Pipeline Dependencies
+
+> **Note:** This page is only applicable to runners that do remote execution.
+
+When you run your pipeline locally, the packages that your pipeline depends on are available
because they are installed on your local machine. However, when you want to run your pipeline
remotely, you must make sure these dependencies are available on the remote machines. This
tutorial shows you how to make your dependencies available to the remote workers. Each section
below refers to a different source that your package may have been installed from.
+
+**Note:** Remote workers used for pipeline execution typically have a standard Python 2.7
distribution installation. If your code relies only on standard Python packages, then you
probably don't need to do anything on this page.
+
+
+## <a name="pypi"></a>PyPI Dependencies
+
+If your pipeline uses public packages from the [Python Package Index](https://pypi.python.org/pypi),
make these packages available remotely by performing the following steps:
+
+**Note:** If your PyPI package depends on a non-Python package (e.g. a package that requires
installation on Linux using the `apt-get install` command), see the [PyPI Dependencies with
Non-Python Dependencies](#nonpython) section instead.
+
+1. Find out which packages are installed on your machine. Run the following command:
+
+        pip freeze > requirements.txt
+
+    This command creates a `requirements.txt` file that lists all packages that are installed
on your machine, regardless of where they were installed from.
+
+2. Edit the `requirements.txt` file and leave only the packages that were installed from
PyPI and are used in the workflow source. Delete all packages that are not relevant to your
code.
+
+3. Run your pipeline with the following command-line option:
+
+        --requirements_file requirements.txt
+
+    The runner will use the `requirements.txt` file to install your additional dependencies
onto the remote workers.
+
+**Important:** Remote workers will install all packages listed in the `requirements.txt`
file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt`
file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will
fail when attempting to install packages from sources that are unknown to them.
+
+
+## <a name="localnonpypi"></a>Local or non-PyPI Dependencies
+
+If your pipeline uses packages that are not available publicly (e.g. packages that you've
downloaded from a GitHub repo), make these packages available remotely by performing the following
steps:
+
+1. Identify which packages are installed on your machine and are not public. Run the following
command:
+
+        pip freeze
+
+    This command lists all packages that are installed on your machine, regardless of where
they were installed from.
+
+2. Run your pipeline with the following command-line option:
+
+        --extra_package /path/to/package/package-name
+
+
+## <a name="multfiles"></a>Multiple File Dependencies
+
+Often, your pipeline code spans multiple files. To run your project remotely, you must group
these files as a Python package and specify the package when you run your pipeline. When the
remote workers start, they will install your package. To group your files as a Python package
and make it available remotely, perform the following steps:
+
+1. Create a [setup.py](https://pythonhosted.org/an_example_pypi_project/setuptools.html)
file for your project. The following is a very basic `setup.py` file.
+
+        setuptools.setup(
+           name='PACKAGE-NAME'
+           version='PACKAGE-VERSION',
+           install_requires=[],
+           packages=setuptools.find_packages(),
+        )
+
+2. Structure your project so that the root directory contains the `setup.py` file, the main
workflow file, and a directory with the rest of the files.
+
+        root_dir/
+          setup.py
+          main.py
+          other_files_dir/
+
+    See [Juliaset](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset)
for an example that follows this required project structure.
+
+3. Run your pipeline with the following command-line option:
+
+        --setup_file /path/to/setup.py
+
+**Note:** If you [created a requirements.txt file](#pypi) and your project spans multiple
files, you can get rid of the `requirements.txt` file and instead, add all packages contained
in `requirements.txt` to the `install_requires` field of the setup call (in step 1).
+
+
+## <a name="nonpython"></a>Non-Python Dependencies or PyPI Dependencies with
Non-Python Dependencies
+
+If your pipeline uses non-Python packages (e.g. packages that require installation using
the `apt-get install` command), or uses a PyPI package that depends on non-Python dependencies
during package installation, you must perform the following steps.
+
+1. Add the required installation commands (e.g. the `apt-get install` commands) for the non-Python
dependencies to the list of `CUSTOM_COMMANDS` in your `setup.py` file. See the [Juliaset setup.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py)
for an example.
+
+    **Note:** You must make sure that these commands are runnable on the remote worker (e.g.
if you use `apt-get`, the remote worker needs `apt-get` support).
+
+2. If you are using a PyPI package that depends on non-Python dependencies, add `['pip',
'install', '<your PyPI package>']` to the list of `CUSTOM_COMMANDS` in your `setup.py`
file.
+
+3. Structure your project so that the root directory contains the `setup.py` file, the main
workflow file, and a directory with the rest of the files.
+
+        root_dir/
+          setup.py
+          main.py
+          other_files_dir/
+
+    See the [Juliaset](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset)
project for an example that follows this required project structure.
+
+4. Run your pipeline with the following command-line option:
+
+        --setup_file /path/to/setup.py
+
+**Note:** Because custom commands execute after the dependencies for your workflow are installed
(by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt`
file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py`
file.
+

http://git-wip-us.apache.org/repos/asf/beam-site/blob/2dd2c59c/src/documentation/sdks/python.md
----------------------------------------------------------------------
diff --git a/src/documentation/sdks/python.md b/src/documentation/sdks/python.md
index eee4801..6a92199 100644
--- a/src/documentation/sdks/python.md
+++ b/src/documentation/sdks/python.md
@@ -17,3 +17,6 @@ Then, follow the [Beam Python SDK Quickstart]({{ site.baseurl }}/get-started/qui
 
 Python is a dynamically-typed language with no static type checking. The Beam SDK for Python
uses type hints during pipeline construction and runtime to try to emulate the correctness
guarantees achieved by true static typing. [Ensuring Python Type Safety]({{ site.baseurl }}/documentation/sdks/python-type-safety)
walks through how to use type hints, which help you to catch potential bugs up front with
the [Direct Runner]({{ site.baseurl }}/documentation/runners/direct/).
 
+## Managing Python Pipeline Dependencies
+
+When you run your pipeline locally, the packages that your pipeline depends on are available
because they are installed on your local machine. However, when you want to run your pipeline
remotely, you must make sure these dependencies are available on the remote machines. [Managing
Python Pipeline Dependencies]({{ site.baseurl }}/documentation/sdks/python-pipeline-dependencies)
shows you how to make your dependencies available to the remote workers.


Mime
View raw message