Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (nike.apache.org: domain of szheng.code@gmail.com
 designates 209.85.223.173 as permitted sender)
From: "Shuai Zheng" <szheng.code@gmail.com>
To: <user@spark.apache.org>
References: <1a4201d03106$b04ac7c0$10e05740$@gmail.com>
In-Reply-To: <1a4201d03106$b04ac7c0$10e05740$@gmail.com>
Subject: RE: Executor parameter doesn't work for Spark-shell on EMR Yarn
Date: Thu, 15 Jan 2015 16:23:26 -0500
Message-ID: <1a7601d03109$8115a350$8340e9f0$@gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_1A77_01D030DF.983FC260"
Thread-Index: AQIH0gCPuAr5btZG1LBLfYIukLOTTJxSbMFw
Content-Language: en-us

------=_NextPart_000_1A77_01D030DF.983FC260
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit

I figure out the second question, because if I don't pass in the num of
partition for the test data, it will by default assume has max executors
(although I don't know what is this default max num).

 
val lines = sc.parallelize(List("-240990|161327,9051480,0,2,30.48,75",
"-240990|161324,9051480,0,2,30.48,75"),2)

will only trigger 2 executors.

 
So I think the default executors num will be decided by the first RDD
operation need to send to executors. This give me a weird way to control the
num of executors (a fake/test code piece run to kick off the executors
first, then run the real behavior - because executor will run the whole
lifecycle of the applications? Although this may not have any real value in
practice J

 
But I still need help for my first question. 

 
Thanks a lot.

 
Regards,

 
Shuai

 
From: Shuai Zheng [mailto:szheng.code@gmail.com] 
Sent: Thursday, January 15, 2015 4:03 PM
To: user@spark.apache.org
Subject: RE: Executor parameter doesn't work for Spark-shell on EMR Yarn

 
Forget to mention, I use EMR AMI 3.3.1, Spark 1.2.0. Yarn 2.4. The spark is
setup by the standard script:
s3://support.elasticmapreduce/spark/install-spark

 
From: Shuai Zheng [mailto:szheng.code@gmail.com] 
Sent: Thursday, January 15, 2015 3:52 PM
To: user@spark.apache.org
Subject: Executor parameter doesn't work for Spark-shell on EMR Yarn

 
Hi All,

 
I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has
32 vCore and 244G memory.

 
But the command line I use to start up spark-shell, it can't work. For
example:

 
~/spark/bin/spark-shell --jars
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6
--executor-memory 10G

 
Neither num-executors nor memory setup works.

 
And more interesting, if I use test code:

val lines = sc.parallelize(List("-240990|161327,9051480,0,2,30.48,75",
"-240990|161324,9051480,0,2,30.48,75"))

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

 
It will start 32 executors (then I assume it try to start all executors for
every vCore).

 
But if I use some real data to do it (the file size is 200M):

val lines = sc.textFile("s3://.../part-r-00000") 

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

It will only start 4 executors, which map to the number of HDFS split (200M
will have 4 splits).

 
So I have two questions:

1, Why the setup parameter is ignored by Yarn? How can I limit the number of
executors I can run? 

2, Why my much smaller test data set will trigger 32 executors but my real
200M data set will only have 4 executors?

 
So how should I control the executor setup on the spark-shell? And I print
the sparkConf, it looks like much less than I expect, and I don't see my
pass in parameter show there.

 
scala> sc.getConf.getAll.foreach(println)

(spark.tachyonStore.folderName,spark-af0c4d42-fe4d-40b0-a3cf-25b6a9e16fa0)

(spark.app.id,local-1421353031552)

(spark.eventLog.enabled,true)

(spark.executor.id,driver)

(spark.repl.class.uri,http://10.181.82.38:58415)

(spark.driver.host,ip-10-181-82-38.ec2.internal)

(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70)

(spark.app.name,Spark shell)

(spark.fileserver.uri,http://10.181.82.38:54666)

(spark.jars,file:/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/aws-java-sdk
-1.9.14.jar)

(spark.eventLog.dir,hdfs:///spark-logs)

(spark.executor.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hado
op/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hado
op/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar)

(spark.master,local[*])

(spark.driver.port,54191)

(spark.driver.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hadoop
/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop
/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar)

 
I search the old threads, attached email answer the question about why vCore
setup doesn't work. But I think this is not same issue as me. Otherwise then
default Yarn Spark setup can't do any adjustment? 

 
Regards,

 
Shuai

 
------=_NextPart_000_1A77_01D030DF.983FC260
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta =
http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii"><meta name=3DGenerator content=3D"Microsoft Word 14 =
(filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:Wingdings;
	panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
	{font-family:SimSun;
	panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
	{font-family:SimSun;
	panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
	{font-family:"\@SimSun";
	panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
	{font-family:"Lucida Console";
	panose-1:2 11 6 9 4 5 4 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0in;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
span.EmailStyle17
	{mso-style-type:personal;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
span.EmailStyle18
	{mso-style-type:personal;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
span.EmailStyle19
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
.MsoChpDefault
	{mso-style-type:export-only;
	font-size:10.0pt;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue =
vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span =
style=3D'color:#1F497D'>I figure out the second question, because if I =
don&#8217;t pass in the num of partition for the test data, it will by =
default assume has max executors (although I don&#8217;t know what is =
this default max num).<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'color:#1F497D'><o:p>&nbsp;</o:p></span></p><p =
class=3DMsoNormal>val lines =3D =
sc.parallelize(List(&quot;-240990|161327,9051480,0,2,30.48,75&quot;, =
&quot;-240990|161324,9051480,0,2,30.48,75&quot;),2)<o:p></o:p></p><p =
class=3DMsoNormal>will only trigger 2 executors.<o:p></o:p></p><p =
class=3DMsoNormal><span =
style=3D'color:#1F497D'><o:p>&nbsp;</o:p></span></p><p =
class=3DMsoNormal><span style=3D'color:#1F497D'>So I think the default =
executors num will be decided by the first RDD operation need to send to =
executors. This give me a weird way to control the num of executors (a =
fake/test code piece run to kick off the executors first, then run the =
real behavior &#8211; because executor will run the whole lifecycle of =
the applications? Although this may not have any real value in practice =
</span><span style=3D'font-family:Wingdings;color:#1F497D'>J</span><span =
style=3D'color:#1F497D'><o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'color:#1F497D'><o:p>&nbsp;</o:p></span></p><p =
class=3DMsoNormal><span style=3D'color:#1F497D'>But I still need help =
for my first question. <o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'color:#1F497D'><o:p>&nbsp;</o:p></span></p><p =
class=3DMsoNormal><span style=3D'color:#1F497D'>Thanks a =
lot.<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'color:#1F497D'><o:p>&nbsp;</o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'color:#1F497D'>Regards,<o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'color:#1F497D'><o:p>&nbsp;</o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'color:#1F497D'>Shuai<o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><div =
style=3D'border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in =
0in 0in'><p class=3DMsoNormal><b><span =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> =
Shuai Zheng [mailto:szheng.code@gmail.com] <br><b>Sent:</b> Thursday, =
January 15, 2015 4:03 PM<br><b>To:</b> =
user@spark.apache.org<br><b>Subject:</b> RE: Executor parameter doesn't =
work for Spark-shell on EMR Yarn<o:p></o:p></span></p></div></div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal><span =
style=3D'color:#1F497D'>Forget to mention, I use EMR AMI 3.3.1, Spark =
1.2.0. Yarn 2.4. The spark is setup by the standard script: </span><span =
style=3D'font-size:10.5pt;font-family:"Arial","sans-serif";color:#444444;=
background:white'>s3://support.elasticmapreduce/spark/install-spark<o:p><=
/o:p></span></p><p class=3DMsoNormal><span =
style=3D'color:#1F497D'><o:p>&nbsp;</o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><div =
style=3D'border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in =
0in 0in'><p class=3DMsoNormal><b><span =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> =
Shuai Zheng [<a =
href=3D"mailto:szheng.code@gmail.com">mailto:szheng.code@gmail.com</a>] =
<br><b>Sent:</b> Thursday, January 15, 2015 3:52 PM<br><b>To:</b> <a =
href=3D"mailto:user@spark.apache.org">user@spark.apache.org</a><br><b>Sub=
ject:</b> Executor parameter doesn't work for Spark-shell on EMR =
Yarn<o:p></o:p></span></p></div></div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Hi =
All,<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>I am testing Spark on EMR cluster. Env is a one node =
cluster r3.8xlarge. Has 32 vCore and 244G memory.<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>But the =
command line I use to start up spark-shell, it can&#8217;t work. For =
example:<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>~/spark/bin/spark-shell --jars =
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6 =
--executor-memory 10G<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Neither =
num-executors nor memory setup works.<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>And more =
interesting, if I use test code:<o:p></o:p></p><p class=3DMsoNormal>val =
lines =3D =
sc.parallelize(List(&quot;-240990|161327,9051480,0,2,30.48,75&quot;, =
&quot;-240990|161324,9051480,0,2,30.48,75&quot;))<o:p></o:p></p><p =
class=3DMsoNormal>var count =3D =
lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>It will =
start 32 executors (then I assume it try to start all executors for =
every vCore).<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>But if I use some real data to do it (the file size is =
200M):<o:p></o:p></p><p class=3DMsoNormal>val lines =3D =
sc.textFile(&quot;s3://.../part-r-00000&quot;) <o:p></o:p></p><p =
class=3DMsoNormal>var count =3D =
lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum<o:p></o:p></p><p =
class=3DMsoNormal>It will only start 4 executors, which map to the =
number of HDFS split (200M will have 4 splits).<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>So I have =
two questions:<o:p></o:p></p><p class=3DMsoNormal>1, Why the setup =
parameter is ignored by Yarn? How can I limit the number of executors I =
can run? <o:p></o:p></p><p class=3DMsoNormal>2, Why my much smaller test =
data set will trigger 32 executors but my real 200M data set will only =
have 4 executors?<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>So how =
should I control the executor setup on the spark-shell? And I print the =
sparkConf, it looks like much less than I expect, and I don&#8217;t see =
my pass in parameter show there.<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal =
style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida Console"'>scala&gt; =
sc.getConf.getAll.foreach(println)<o:p></o:p></span></p><p =
class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.tachyonStore.folderName,spark-af0c4d42-fe4d-40b0-a3cf-25=
b6a9e16fa0)<o:p></o:p></span></p><p class=3DMsoNormal =
style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.app.id,local-1421353031552)<o:p></o:p></span></p><p =
class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.eventLog.enabled,true)<o:p></o:p></span></p><p =
class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.executor.id,driver)<o:p></o:p></span></p><p =
class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.repl.class.uri,http://10.181.82.38:58415)<o:p></o:p></sp=
an></p><p class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.driver.host,ip-10-181-82-38.ec2.internal)<o:p></o:p></sp=
an></p><p class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.executor.extraJavaOptions,-verbose:gc =
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC =
-XX:CMSInitiatingOccupancyFraction=3D70 =
-XX:MaxHeapFreeRatio=3D70)<o:p></o:p></span></p><p class=3DMsoNormal =
style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.app.name,Spark shell)<o:p></o:p></span></p><p =
class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.fileserver.uri,http://10.181.82.38:54666)<o:p></o:p></sp=
an></p><p class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.jars,file:/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib=
/aws-java-sdk-1.9.14.jar)<o:p></o:p></span></p><p class=3DMsoNormal =
style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.eventLog.dir,hdfs:///spark-logs)<o:p></o:p></span></p><p=
 class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.executor.extraClassPath,/home/hadoop/spark/classpath/emr=
/*:/home/hadoop/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/=
lib/*:/home/hadoop/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar=
)<o:p></o:p></span></p><p class=3DMsoNormal =
style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.master,local[*])<o:p></o:p></span></p><p =
class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.driver.port,54191)<o:p></o:p></span></p><p =
class=3DMsoNormal style=3D'text-autospace:none'><span =
style=3D'font-size:9.0pt;font-family:"Lucida =
Console"'>(spark.driver.extraClassPath,/home/hadoop/spark/classpath/emr/*=
:/home/hadoop/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/li=
b/*:/home/hadoop/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar)<=
o:p></o:p></span></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>I search the old threads, attached email answer the =
question about why vCore setup doesn&#8217;t work. But I think this is =
not same issue as me. Otherwise then default Yarn Spark setup =
can&#8217;t do any adjustment? <o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>Regards,<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>Shuai<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div></body></html>
------=_NextPart_000_1A77_01D030DF.983FC260--