Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <20150703033502.AD47173E301@webmail.sinamail.sina.com.cn>
References: <20150703033502.AD47173E301@webmail.sinamail.sina.com.cn>
Date: Sat, 4 Jul 2015 17:11:30 -0700
Message-ID: 
 <CALte62x4NTXjn0jXCv2kxBxkeJwsXGJi+7CDc-aKuYozEoPNfA@mail.gmail.com>
Subject: Re: All master are unreponsive issue
From: Ted Yu <yuzhihong@gmail.com>
To: luohui20001@sina.com
Cc: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a1138ea3c9075b5051a15a1c2

--001a1138ea3c9075b5051a15a1c2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Currently the number of retries is hardcoded.

You may want to open a JIRA which makes the retry count configurable.

Cheers

On Thu, Jul 2, 2015 at 8:35 PM, <luohui20001@sina.com> wrote:

> Hi there=EF=BC=8C
>
>        i check the source code and found that in
> org.apache.spark.deploy.client.AppClient, there is a parameter tells(line
> 52):
>
>   val REGISTRATION_TIMEOUT =3D 20.seconds
>
>   val REGISTRATION_RETRIES =3D 3
>
> As I know If I wanna increase the retry times, must I modify this
> value,rebuild the entire Spark project and then redeply spark cluster wit=
h
> my modified version?
>
> Or is there a better way to solve this issue?
>
> Thanks.
>
>
>
>
> --------------------------------
>
> Thanks&amp;Best regards!
> San.Luo
>
> ----- =E5=8E=9F=E5=A7=8B=E9=82=AE=E4=BB=B6 -----
> =E5=8F=91=E4=BB=B6=E4=BA=BA=EF=BC=9A<luohui20001@sina.com>
> =E6=94=B6=E4=BB=B6=E4=BA=BA=EF=BC=9A"user" <user@spark.apache.org>
> =E4=B8=BB=E9=A2=98=EF=BC=9AAll master are unreponsive issue
> =E6=97=A5=E6=9C=9F=EF=BC=9A2015=E5=B9=B407=E6=9C=8802=E6=97=A5 17=E7=82=
=B931=E5=88=86
>
> Hi there:
>
>       I got an problem that "Application has been killed.Reason:All
> masters are unresponsive!Giving up." I check the network I/O and found
> sometimes it is really high when running my app. Pls refer to the attache=
d
> pic for more info.
>
> I also checked
> http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/tro=
ubleshooting/connectivity_issues.html,
> and set SPARK_LOCAL_IP in every node's spark-env.sh of my spark cluster.
> However it does not benifit in solving this problem.
>
> I am not sure if this parameter is correctly set,my setting is like this:
>
> On node1:
>
> export SPARK_LOCAL_IP=3D{node1's IP}
>
> On node2:
>
> export SPARK_LOCAL_IP=3D{node2's IP}
>
> ......
>
>
>
> BTW,I guess that the akka will retry 3 times when communicate between
> master and slave, it is possible to increase the akka retries?
>
>
> And except expand the network bandwidth, is there another way to solve
> this problem?
>
>
> thanks for any coming ideas.
>
> --------------------------------
>
> Thanks&amp;Best regards!
> San.Luo
>

--001a1138ea3c9075b5051a15a1c2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Currently the number of retries is hardcoded.<div><br></di=
v><div>You may want to open a JIRA which makes the retry count configurable=
.</div><div><br></div><div>Cheers</div></div><div class=3D"gmail_extra"><br=
><div class=3D"gmail_quote">On Thu, Jul 2, 2015 at 8:35 PM,  <span dir=3D"l=
tr">&lt;<a href=3D"mailto:luohui20001@sina.com" target=3D"_blank">luohui200=
01@sina.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p>Hi t=
here=EF=BC=8C</p><p>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i check the source=
 code and found that in org.apache.spark.deploy.client.AppClient, there is =
a parameter tells(line 52):</p><p>=C2=A0 val REGISTRATION_TIMEOUT =3D 20.se=
conds<br></p><p>=C2=A0 val REGISTRATION_RETRIES =3D 3</p><p>As I know If I =
wanna increase the retry times, must I modify this value,rebuild the entire=
 Spark project and then redeply spark cluster with my modified version?</p>=
<p>Or is there a better way to solve this issue?</p><p>Thanks.<br></p><span=
 class=3D""><p><br></p><br><br><div>--------------------------------<br></d=
iv><div>
<div>=C2=A0<br></div>
Thanks&amp;amp;Best regards!<br>San.Luo<br></div><br></span><div><div style=
=3D"background:#f2f2f2">----- =E5=8E=9F=E5=A7=8B=E9=82=AE=E4=BB=B6 -----<br=
>=E5=8F=91=E4=BB=B6=E4=BA=BA=EF=BC=9A&lt;<a href=3D"mailto:luohui20001@sina=
.com" target=3D"_blank">luohui20001@sina.com</a>&gt;<br>=E6=94=B6=E4=BB=B6=
=E4=BA=BA=EF=BC=9A&quot;user&quot; &lt;<a href=3D"mailto:user@spark.apache.=
org" target=3D"_blank">user@spark.apache.org</a>&gt;<br>=E4=B8=BB=E9=A2=98=
=EF=BC=9AAll master are unreponsive issue<br>=E6=97=A5=E6=9C=9F=EF=BC=9A201=
5=E5=B9=B407=E6=9C=8802=E6=97=A5 17=E7=82=B931=E5=88=86<br></div><div><div =
class=3D"h5"><br><p>Hi there:</p><p>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I got an=
 problem that &quot;Application has been killed.Reason:All masters are unre=
sponsive!Giving up.&quot; I check the network I/O and found sometimes it is=
 really high when running my app. Pls refer to the attached pic for more in=
fo.</p><p>I also checked <a href=3D"http://databricks.gitbooks.io/databrick=
s-spark-knowledge-base/content/troubleshooting/connectivity_issues.html" ta=
rget=3D"_blank">http://databricks.gitbooks.io/databricks-spark-knowledge-ba=
se/content/troubleshooting/connectivity_issues.html</a>, and set SPARK_LOCA=
L_IP in every node&#39;s spark-env.sh of my spark cluster. However it does =
not benifit in solving this problem.</p><p>I am not sure if this parameter =
is correctly set,my setting is like this:</p><p>On node1:</p><p>export SPAR=
K_LOCAL_IP=3D{node1&#39;s IP}</p><p>On node2:</p><p>export SPARK_LOCAL_IP=
=3D{node2&#39;s IP}</p><p>......</p><p><br></p><p><br></p><p>BTW,I guess th=
at the akka will retry 3 times when communicate between master and slave, i=
t is possible to increase the akka retries?<br></p><p><br></p><p>And except=
 expand the network bandwidth, is there another way to solve this problem?<=
/p><p><br></p><p>thanks for any coming ideas.<br></p><br><div>-------------=
-------------------<br></div><div>
<div>=C2=A0<br></div>
Thanks&amp;amp;Best regards!<br>San.Luo<br></div></div></div></div></blockq=
uote></div><br></div>

--001a1138ea3c9075b5051a15a1c2--