spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vanzin <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-5479] [yarn] Handle --py-files correctl...
Date Fri, 22 May 2015 18:44:11 GMT
GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/6360

    [SPARK-5479] [yarn] Handle --py-files correctly in YARN.

    The bug description is a little misleading: the actual issue is that
    .py files are not handled correctly when distributed by YARN. They're
    added to "spark.submit.pyFiles", which, when processed by context.py,
    explicitly whitelists certain extensions (see PACKAGE_EXTENSIONS),
    and that does not include .py files.
    
    On top of that, archives were not handled at all! They made it to the
    driver's python path, but never made it to executors, since the mechanism
    used to propagate their location (spark.submit.pyFiles) only works on
    the driver side.
    
    So, instead, ignore "spark.submit.pyFiles" and just build PYTHONPATH
    correctly for both driver and executors. Individual .py files are
    placed in a subdirectory of the container's local dir in the cluster,
    which is then added to the python path. Archives are added directly.
    
    The change, as a side effect, ends up solving the symptom described
    in the bug. The issue was not that the files were not being distributed,
    but that they were never made visible to the python application
    running under Spark.
    
    Also included is a proper unit test for running python on YARN, which
    broke in several different ways with the previous code.
    
    A short walk around of the changes:
    - SparkSubmit does not try to be smart about how YARN handles python
      files anymore. It just passes down the configs to the YARN client
      code.
    - The YARN client distributes python files and archives differently,
      placing the files in a subdirectory.
    - The YARN client now sets PYTHONPATH for the processes it launches;
      to properly handle different locations, it uses YARN's support for
      embedding env variables, so to avoid YARN expanding those at the
      wrong time, SparkConf is now propagate to the AM using a conf file
      instead of command line options.
    - Because the Client initialization code is a maze of implicit
      dependencies, some code needed to be moved around to make sure
      all needed state was available when the code ran.
    - The pyspark tests in YarnClusterSuite now actually distribute and try
      to use both a python file and an archive containing a different python
      module. Also added a yarn-client tests for completeness.
    - I cleaned up some of the code around distributing files to YARN, to
      avoid adding more copied & pasted code to handle the new files being
      distributed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-5479

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6360.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6360
    
----
commit 943cbf450d32f49a16091247f2bf7e0679d184ae
Author: Marcelo Vanzin <vanzin@cloudera.com>
Date:   2015-05-21T00:29:29Z

    [SPARK-5479] [yarn] Handle --py-files correctly in YARN.
    
    The bug description is a little misleading: the actual issue is that
    .py files are not handled correctly when distributed by YARN. They're
    added to "spark.submit.pyFiles", which, when processed by context.py,
    explicitly whitelists certain extensions (see PACKAGE_EXTENSIONS),
    and that does not include .py files.
    
    On top of that, archives were not handled at all! They made it to the
    driver's python path, but never made it to executors, since the mechanism
    used to propagate their location (spark.submit.pyFiles) only works on
    the driver side.
    
    So, instead, ignore "spark.submit.pyFiles" and just build PYTHONPATH
    correctly for both driver and executors. Individual .py files are
    placed in a subdirectory of the container's local dir in the cluster,
    which is then added to the python path. Archives are added directly.
    
    The change, as a side effect, ends up solving the symptom described
    in the bug. The issue was not that the files were not being distributed,
    but that they were never made visible to the python application
    running under Spark.
    
    Also included is a proper unit test for running python on YARN, which
    broke in several different ways with the previous code.
    
    A short walk around of the changes:
    - SparkSubmit does not try to be smart about how YARN handles python
      files anymore. It just passes down the configs to the YARN client
      code.
    - The YARN client distributes python files and archives differently,
      placing the files in a subdirectory.
    - The YARN client now sets PYTHONPATH for the processes it launches;
      to properly handle different locations, it uses YARN's support for
      embedding env variables, so to avoid YARN expanding those at the
      wrong time, SparkConf is now propagate to the AM using a conf file
      instead of command line options.
    - Because the Client initialization code is a maze of implicit
      dependencies, some code needed to be moved around to make sure
      all needed state was available when the code ran.
    - The pyspark tests in YarnClusterSuite now actually distribute and try
      to use both a python file and an archive containing a different python
      module. Also added a yarn-client tests for completeness.
    - I cleaned up some of the code around distributing files to YARN, to
      avoid adding more copied & pasted code to handle the new files being
      distributed.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message