spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominik Fries <dominik.fr...@woodmark.de>
Subject Re: spark multi tenancy
Date Wed, 07 Oct 2015 10:15:50 GMT
Currently we try to execute pyspark from user CLI, but in context of project user, but get
this error : (the cluster is kerberized)

[<user>@edgenode1 ~]$ pyspark --master yarn --num-executors 5 --proxy-user <project-user>
Python 2.7.5 (default, Jun 24 2015, 00:41:19) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
15/10/06 09:44:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
15/10/06 09:44:25 INFO SparkContext: Running Spark version 1.3.1
15/10/06 09:44:25 INFO SecurityManager: Changing view acls to: <user>,<project-user>
15/10/06 09:44:25 INFO SecurityManager: Changing modify acls to: <user>,<project-user>
15/10/06 09:44:25 INFO SecurityManager: SecurityManager: authentication disabled; ui acls
disabled; users with view permissions: Set(<user>, <project-user>); users with
modify permissions: Set(<user>, <project-user>)
15/10/06 09:44:25 INFO Slf4jLogger: Slf4jLogger started
15/10/06 09:44:25 INFO Remoting: Starting remoting
15/10/06 09:44:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@<server>:40607]
15/10/06 09:44:26 INFO Utils: Successfully started service 'sparkDriver' on port 40607.
15/10/06 09:44:26 INFO SparkEnv: Registering MapOutputTracker
15/10/06 09:44:26 INFO SparkEnv: Registering BlockManagerMaster
15/10/06 09:44:26 INFO DiskBlockManager: Created local directory at /tmp/spark-10b70025-ca98-4940-91b8-6dbd0b7148aa/blockmgr-33e9fb6d-d5b2-4fa5-876f-0b91501be632
15/10/06 09:44:26 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/10/06 09:44:26 INFO HttpFileServer: HTTP File server directory is /tmp/spark-1a4b86f0-3e57-4f44-bded-6157f4f1933f/httpd-2cafcce9-71ec-44fb-8500-2c70756ea3b9
15/10/06 09:44:26 INFO HttpServer: Starting HTTP Server
15/10/06 09:44:26 INFO Server: jetty-8.y.z-SNAPSHOT
15/10/06 09:44:26 INFO AbstractConnector: Started SocketConnector@0.0.0.0:34903
15/10/06 09:44:26 INFO Utils: Successfully started service 'HTTP file server' on port 34903.
15/10/06 09:44:26 INFO SparkEnv: Registering OutputCommitCoordinator
15/10/06 09:44:26 INFO Server: jetty-8.y.z-SNAPSHOT
15/10/06 09:44:26 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/10/06 09:44:26 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/10/06 09:44:26 INFO SparkUI: Started SparkUI at http://<server>:4040
spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
15/10/06 09:44:27 INFO TimelineClientImpl: Timeline service address: http://<master-node>:8188/ws/v1/timeline/
15/10/06 09:44:27 INFO RMProxy: Connecting to ResourceManager at <master-node>/10.49.20.5:8050
Traceback (most recent call last):
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/shell.py", line 50, in <module>
    sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 110, in __init__
    conf, jsc, profiler_cls)
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 158, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 211, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/home/<user>/.local/lib/python2.7/site-packages/py4j-0.9-py2.7.egg/py4j/java_gateway.py",
line 1064, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/home/<user>/.local/lib/python2.7/site-packages/py4j-0.9-py2.7.egg/py4j/protocol.py",
line 308, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.

@guha, yes you can separate workloads via yarn capacity scheduler. 

Von: ayan guha [mailto:guha.ayan@gmail.com] 
Gesendet: Mittwoch, 7. Oktober 2015 12:06
An: Steve Loughran <stevel@hortonworks.com>
Cc: user <user@spark.apache.org>; Dominik Fries <dominik.fries@woodmark.de>
Betreff: Re: spark multi tenancy

Can queues also be used to separate workloads?
On 7 Oct 2015 20:34, "Steve Loughran" <stevel@hortonworks.com> wrote:

> On 7 Oct 2015, at 09:26, Dominik Fries <dominik.fries@woodmark.de> wrote:
>
> Hello Folks,
>
> We want to deploy several spark projects and want to use a unique project
> user for each of them. Only the project user should start the spark
> application and have the corresponding packages installed.
>
> Furthermore a personal user, which belongs to a specific project, should
> start a spark application via the corresponding spark project user as proxy.
> (Development)
>
> The Application is currently running with ipython / pyspark. (HDP 2.3 -
> Spark 1.3.1)
>
> Is this possible or what is the best practice for a spark multi tenancy
> environment ?
>
>

Deploy on a kerberized YARN cluster and each application instance will be running as a different
unix user in the cluster, with the appropriate access to HDFS —isolated.

The issue then becomes "do workloads clash with each other?". If you want to isolate dev &
production, using node labels to keep dev work off the production nodes is the standard technique.


Mime
View raw message