spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niels Becker (JIRA)" <>
Subject [jira] [Commented] (SPARK-7791) Set user for executors in standalone-mode
Date Fri, 31 Jul 2015 20:09:04 GMT


Niels Becker commented on SPARK-7791:

I run into the same problem saving a dataframe as parquet.
Our Environment:
- Ubuntu 14
- Spark 1.4.1 prebuild for Hadoop 2.6
- GlusterFS 3.7
- Mesos 0.23.0
- Docker 1.7.1

Start _pyspark_ as _sparkuser_ and load some data into a dataframe {{df}}. Then run {{df.write.format("parquet").save("/data/test/wikipedia_test.parquet")}}
_/data_ is a GlusterFS voulme on each node
_/data/test_ permissions:
# owner: sparkuser
# group: sparkuser
# flags: -s-

Tomasz described a workaround in []
but that does not work for us.
Interesting thing is that {{*.gz.parquet}} files have {noformat}root:sparkuser -rw-r--r--{noformat}
but {{*.gz.parquet.crc}} files have {noformat}root:sparkuser -rw-rw-r--{noformat} permissions
like they should have been.
This sugests that spark does not use default file permissions at least for parquet files.

I can confirm that setting {{SPARK_USER}} to either {{root}} nor {{sparkuser}} has no effekt.
Running pyspark as root works.

I assume that all spark tasks are executed as root and overwrite the default file permissions
but do not change the user.
So after the job is done the driver tries to rename the files to its final destination but
fails because lack of permissions.

> Set user for executors in standalone-mode
> -----------------------------------------
>                 Key: SPARK-7791
>                 URL:
>             Project: Spark
>          Issue Type: Wish
>          Components: Spark Core
>            Reporter: Tomasz Früboes
> I'm opening this following a discussion in
>  Our setup was following. Spark (1.3.1, prebuilt for hadoop 2.6, also 2.4) was installed
in the standalone mode and started manually from the root account. Everything worked properly
apart of operations  such us
> rdd.saveAsPickleFile(ofile)
> which end with exception:
> py4j.protocol.Py4JJavaError: An error occurred while calling
> : Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_000001/part-r-00002.parquet;
isDirectory=false; length=534; replication=1; blocksize=33554432; modification_time=1432042832000;
access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-00002.parquet
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(
> (files created in _temporary were owned by user root). It would be great if spark could
set the user for the executor also in standalone mode. Setting SPARK_USER has no effect here.
> BTW it may be a good idea to add some warning (e.g. during spark startup) that running
from root account is not very healthy idea. E.g. mapping this function 
> def test(x):
>    f = open('/etc/testTMF.txt', 'w')
>    return 0
> on a rdd creates a file in /etc/ (surprisingly calls like f.Write("text") end with an
> Thanks,
>   Tomasz

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message