beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Wegner (JIRA)" <>
Subject [jira] [Commented] (BEAM-680) Python Dataflow stages stale requirements-cache dependencies
Date Mon, 26 Sep 2016 20:32:20 GMT


Scott Wegner commented on BEAM-680:

/cc [~robertwb]

This came up as an issue with [dependency_test.test_with_requirements_file()|]
in [PR 1005|].

We use pip to download all required dependencies, but generate the full listing by scanning
the cache directory. Perhaps there is a way to ask pip for the transitive dependency list
as well.

> Python Dataflow stages stale requirements-cache dependencies
> ------------------------------------------------------------
>                 Key: BEAM-680
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>            Reporter: Scott Wegner
>            Priority: Minor
> When executing a python pipeline using a requirements.txt file, the Dataflow runner will
stage all dependencies downloaded to its requirements cache directory, including those specified
in the requirements.txt, and any previously cached dependencies. This results in bloated staging
directory if previous pipeline runs from the same machine included different dependencies.
> Repro:
> # Initialize a virtualenv and pip install apache_beam
> # Create an empty requirements.txt file
> # Create a simple pipeline using DataflowPipelineRunner and a requirements.txt file,
for example: [|]
> # {{touch /tmp/dataflow-requirements-cache/extra-file.txt}}
> # Run the pipeline with a specified staging directory
> # Check the staged files for the job
> 'extra-file.txt' will be uploaded with the job, along with any other cached dependencies
under /tmp/dataflow-requirements-cache.
> We should only be staging the dependencies necessary for a pipeline, not all previously-cached
dependencies found on the machine.

This message was sent by Atlassian JIRA

View raw message