pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH
Date Wed, 01 Jun 2016 20:13:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311028#comment-15311028
] 

Rohini Palaniswamy commented on PIG-4903:
-----------------------------------------

Please refer to my comment in 

https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311013#comment-15311013

You should avoid shipping spark jars altogether similar to mapreduce and tez. Please take
hdfs location of spark jar(s) as input from the user. This will be an extra setup step that
user will have to do. But that should be fine as we already have same thing to be done for
Tez and Mapreduce . For mapreduce, tar balls approach is required to have smooth rolling upgrades.
Older method of using mapreduce installation of node manager also works and most still do
that. But recommended approach with yarn is to use hdfs tarballs for application master dependencies.

> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-4903
>                 URL: https://issues.apache.org/jira/browse/PIG-4903
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>         Attachments: PIG-4903.patch
>
>
> There are some comments about bin/pig on https://reviews.apache.org/r/45667/#comment198955.
> {code}
> ################# ADDING SPARK DEPENDENCIES ##################
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
>     if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
>         # Exclude spark-assembly.jar from shipped jars, but retain in classpath
>         SPARK_JARS=${SPARK_JARS}:$f;
>     else
>         SPARK_JARS=${SPARK_JARS}:$f;
>         SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
>         SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
>     fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like spark-network-shuffle_2.10-1.6.1 jar
to distcache(SPARK_YARN_DIST_FILES) then add them to the classpath of executor(SPARK_DIST_CLASSPATH).
Actually we need not copy all these depency jar to SPARK_DIST_CLASSPATH because all these
dependency jars are included in spark-assembly.jar and spark-assembly.jar is uploaded with
the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message