spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vogxn <...@git.apache.org>
Subject [GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Date Mon, 07 Nov 2016 11:25:13 GMT
Github user vogxn commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13599#discussion_r86756833
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
    @@ -69,6 +84,66 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars:
Map[String
       }
     
       /**
    +   * Create virtualenv using native virtualenv or conda
    +   *
    +   * Native Virtualenv:
    +   *   -  Execute command: virtualenv -p pythonExec --no-site-packages virtualenvName
    +   *   -  Execute command: python -m pip --cache-dir cache-dir install -r requirement_file
    +   *
    +   * Conda
    +   *   -  Execute command: conda create --prefix prefix --file requirement_file -y
    +   *
    +   */
    +  def setupVirtualEnv(): Unit = {
    +    logDebug("Start to setup virtualenv...")
    +    logDebug("user.dir=" + System.getProperty("user.dir"))
    +    logDebug("user.home=" + System.getProperty("user.home"))
    +
    +    require(virtualEnvType == "native" || virtualEnvType == "conda",
    +      s"VirtualEnvType: ${virtualEnvType} is not supported" )
    +    virtualEnvName = "virtualenv_" + conf.getAppId + "_" + VIRTUALENV_ID.getAndIncrement()
    +    // use the absolute path when it is local mode otherwise just use filename as it
would be
    +    // fetched from FileServer
    +    val pyspark_requirements =
    +      if (Utils.isLocalMaster(conf)) {
    +        conf.get("spark.pyspark.virtualenv.requirements")
    +      } else {
    +        conf.get("spark.pyspark.virtualenv.requirements").split("/").last
    +      }
    +
    +    val createEnvCommand =
    +      if (virtualEnvType == "native") {
    +        Arrays.asList(virtualEnvPath,
    +          "-p", pythonExec,
    +          "--no-site-packages", virtualEnvName)
    +      } else {
    +        Arrays.asList(virtualEnvPath,
    +          "create", "--prefix", System.getProperty("user.dir") + "/" + virtualEnvName,
    --- End diff --
    
    Started writing this comment and had to recompile my cluster. I figured I had made a mistake
in the permissions. Apologise for the false alarm. The patch works fine and I'm able to run
executors with the conda environment. I'll do some more testing from my end.
    
    ===== Following was my setup =====
    Apache Spark (with this patch) is compiled with Apache Hadoop 2.6.0. I've installed `anaconda2-4.1.1`
on all my nodes in the cluster under `/usr/lib/anaconda2`. I can create conda environments
using the command `conda create --prefix test-env numpy -y` fine.
    
    The following shell script is used to submit my pyspark programs:
    
    ```
    $ cat run.sh
    /usr/lib/spark/bin/spark-submit  --master yarn --deploy-mode client \
        --conf "spark.pyspark.virtualenv.enabled=true" \
        --conf "spark.pyspark.virtualenv.type=conda" \
        --conf "spark.pyspark.virtualenv.requirements=/home/tsp/conda.txt" \
        --conf "spark.pyspark.virtualenv.bin.path=/usr/lib/anaconda2/bin/conda" "$@"
    ```
    
    This is the program I've submitted to see if the anaconda environment is detected in the
executors
    
    ```
    $ cat execinfo.py
    from pyspark import SparkContext
    import sys
    
    if __name__ == '__main__':
      sc = SparkContext()
      print sys.version
      print sc.parallelize(range(1,2)).map(lambda x: sys.version).collect()
    ```
    
    This is what is seen in the debug logs
    ```
    Caused by: java.lang.RuntimeException: Fail to run command: /usr/lib/anaconda2/bin/conda
create --prefix /media/ebs2/yarn/local/usercache/tsp/appcache/application_1478497303110_0005/container_1478497303110_0005_01_000003/virtualenv_application_1478497303110_0005_3-
    -file conda.txt -y
            at org.apache.spark.api.python.PythonWorkerFactory.execCommand(PythonWorkerFactory.scala:142)
            at org.apache.spark.api.python.PythonWorkerFactory.setupVirtualEnv(PythonWorkerFactory.scala:124)
            at org.apache.spark.api.python.PythonWorkerFactory.<init>(PythonWorkerFactory.scala:70)
    ```
    
    `/media/ebs2/yarn` is owned by `yarn (id): hadoop (gid)`
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message