spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <>
Subject [jira] [Commented] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address
Date Thu, 02 Apr 2015 17:10:53 GMT


Cheolsoo Park commented on SPARK-6662:

[~srowen], thank you for your comment.
Wouldn't you be able to query for the YARN RM address somewhere and include it in the config?
In typical cloud deployment, there is usually shared gateway from where users can connect
to various clusters, and there is few Spark configs shared by all the clusters. Furthermore,
clusters are usually transient in cloud, so I'd like to avoid adding any cluster-specific
information to Spark configs.

My current workaround is grep'ing {{yarn.resourcemanager.hostname}} from yarn-site.xml in
my custom job launch script on the gateway and passing it via {{--conf}} option in every job
launch. The intention was to get rid of this hacky bit in my launch script.
I am somewhat concerned about adding a narrow bit of support for one particular substitution,
which in turn is to support a specific assumption in one type of deployment.
Yes, I understand your concern. Even though I have a specific problem to solve at hand, I
filed this jira hoping that general variable substitution will be added to Spark config. In
fact, I made an attempt in that direction but quickly ran into the following problems:
# Adding general vars sub to Spark conf doesn't solve my problem. Since Spark config and Yarn
config are separate entities in Spark, I cannot cross-refer to properties from one to the
# Alternatively, I could introduce a special logic for {{spark.yarn.historyServer.address}}
assuming the RM and HS are on the same node. Since Spark AM already knows the RM address,
it is trivial to implement. But this makes a even more specific assumption about the deployment.

Looks to me that it involves quite a bit of refactoring to implement general vars sub that
allows cross-referring.

So I compromised. That is, I introduced vars sub only to the {{spark.yarn.}} properties. In
fact, vars sub already work for {{spark.hadoop.}} properties. If you look at the code, all
the {{spark.hadoop.}} properties are already copied over to Yarn config and read via Yarn
config. As a side effect, they support vars sub. I am just expanding the scope of this *secret*
feature to {{spark.yarn.}} properties.

For now, I can live with my current workaround. But I wanted to point out that it is not user-friendly
to ask users to pass explicit hostname and port number to make use of HS. In fact, I'm not
aware of any other property that causes same pain in YARN mode. For eg, the RM address for
{{spark.master}} is dynamically picked up from yarn-site.xml. The HS address should be handled
in a similar manner IMO.

Hope this explains my thought process well enough.

> Allow variable substitution in spark.yarn.historyServer.address
> ---------------------------------------------------------------
>                 Key: SPARK-6662
>                 URL:
>             Project: Spark
>          Issue Type: Wish
>          Components: YARN
>    Affects Versions: 1.3.0
>            Reporter: Cheolsoo Park
>            Priority: Minor
>              Labels: yarn
> In Spark on YARN, explicit hostname and port number need to be set for "spark.yarn.historyServer.address"
in SparkConf to make the HISTORY link. If the history server address is known and static,
this is usually not a problem.
> But in cloud, that is usually not true. Particularly in EMR, the history server always
runs on the same node as with RM. So I could simply set it to {{$\{yarn.resourcemanager.hostname\}:18080}}
if variable substitution is allowed.
> In fact, Hadoop configuration already implements variable substitution, so if this property
is read via YarnConf, this can be easily achievable.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message