spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Schmei├čer (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
Date Thu, 20 Apr 2017 15:35:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976902#comment-15976902
] 

Michael Schmei├čer edited comment on SPARK-650 at 4/20/17 3:34 PM:
------------------------------------------------------------------

In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}}
and performs our custom initialization in {{MySerializer#newInstance}} before calling the
super method {{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building
the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer",
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to look at your
classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer
is listed in the following properties (we use Spark 1.5.0 on YARN in cluster mode):
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced by us because
we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again
later, so you should only need the properties with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107.
2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}.
We have some functionality to log the progress of a function in fixed intervals (e.g. every
1,000 records). To do this, you can use mapPartitions with a custom iterator. 


was (Author: skamandros):
In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}}
and performs our custom initialization in {{MySerializer#newInstance}} before calling the
super method {{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building
the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer",
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to look at your
classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer
is listed in the following properties (we use Spark on YARN in cluster mode):
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced by us because
we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again
later, so you should only need the properties with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107.
2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}.
We have some functionality to log the progress of a function in fixed intervals (e.g. every
1,000 records). To do this, you can use mapPartitions with a custom iterator. 

> Add a "setup hook" API for running initialization code on each executor
> -----------------------------------------------------------------------
>
>                 Key: SPARK-650
>                 URL: https://issues.apache.org/jira/browse/SPARK-650
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Matei Zaharia
>            Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message