Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D028518CFB for ; Sun, 5 Jul 2015 00:11:50 +0000 (UTC) Received: (qmail 88693 invoked by uid 500); 5 Jul 2015 00:11:47 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 88586 invoked by uid 500); 5 Jul 2015 00:11:47 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 88575 invoked by uid 99); 5 Jul 2015 00:11:47 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Jul 2015 00:11:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 39AF818286D for ; Sun, 5 Jul 2015 00:11:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id tY5iOStzBrkH for ; Sun, 5 Jul 2015 00:11:37 +0000 (UTC) Received: from mail-yk0-f175.google.com (mail-yk0-f175.google.com [209.85.160.175]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id AF090428CC for ; Sun, 5 Jul 2015 00:11:36 +0000 (UTC) Received: by ykfy125 with SMTP id y125so121854242ykf.1 for ; Sat, 04 Jul 2015 17:11:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=PNfmkYjct+HUjhX9ZIU6gEWhlDxow5FE7NRuDai0AT8=; b=LuNbYfzAzXxhBxIbxbwsMIV0whYvuLgAjE31ufxuhkEJGdRlyJJvz0yXBa9iIljAfj tJ7N4k88LkB2/BNwvBjm0tOv0IqSPc8y8jSfLlBbjnG6eSNcwMaDkoiCDztj+Amerx37 y2EuJ9VrXrqMuZRzGtYj9ENgJnOXrBaT4coXrVl0uwht/TyqcQfsbW2dKlrA6mhX3Lkj eWqwhQE6S9jJ2p/dJdfpTQKSaof9xrPJU/IdaiePOdNwSgbEk2EEnaVNaMJ16ex2f+6S lXKmtP1WQHt39uLl98GcYifkH2vdwd/bOmOWdzTAfJ+jBT9IUiSockotjbyd9qTDy8Q2 ygQg== MIME-Version: 1.0 X-Received: by 10.170.118.80 with SMTP id k77mr51636587ykb.64.1436055090785; Sat, 04 Jul 2015 17:11:30 -0700 (PDT) Received: by 10.37.208.142 with HTTP; Sat, 4 Jul 2015 17:11:30 -0700 (PDT) In-Reply-To: <20150703033502.AD47173E301@webmail.sinamail.sina.com.cn> References: <20150703033502.AD47173E301@webmail.sinamail.sina.com.cn> Date: Sat, 4 Jul 2015 17:11:30 -0700 Message-ID: Subject: Re: All master are unreponsive issue From: Ted Yu To: luohui20001@sina.com Cc: user Content-Type: multipart/alternative; boundary=001a1138ea3c9075b5051a15a1c2 --001a1138ea3c9075b5051a15a1c2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Currently the number of retries is hardcoded. You may want to open a JIRA which makes the retry count configurable. Cheers On Thu, Jul 2, 2015 at 8:35 PM, wrote: > Hi there=EF=BC=8C > > i check the source code and found that in > org.apache.spark.deploy.client.AppClient, there is a parameter tells(line > 52): > > val REGISTRATION_TIMEOUT =3D 20.seconds > > val REGISTRATION_RETRIES =3D 3 > > As I know If I wanna increase the retry times, must I modify this > value,rebuild the entire Spark project and then redeply spark cluster wit= h > my modified version? > > Or is there a better way to solve this issue? > > Thanks. > > > > > -------------------------------- > > Thanks&Best regards! > San.Luo > > ----- =E5=8E=9F=E5=A7=8B=E9=82=AE=E4=BB=B6 ----- > =E5=8F=91=E4=BB=B6=E4=BA=BA=EF=BC=9A > =E6=94=B6=E4=BB=B6=E4=BA=BA=EF=BC=9A"user" > =E4=B8=BB=E9=A2=98=EF=BC=9AAll master are unreponsive issue > =E6=97=A5=E6=9C=9F=EF=BC=9A2015=E5=B9=B407=E6=9C=8802=E6=97=A5 17=E7=82= =B931=E5=88=86 > > Hi there: > > I got an problem that "Application has been killed.Reason:All > masters are unresponsive!Giving up." I check the network I/O and found > sometimes it is really high when running my app. Pls refer to the attache= d > pic for more info. > > I also checked > http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/tro= ubleshooting/connectivity_issues.html, > and set SPARK_LOCAL_IP in every node's spark-env.sh of my spark cluster. > However it does not benifit in solving this problem. > > I am not sure if this parameter is correctly set,my setting is like this: > > On node1: > > export SPARK_LOCAL_IP=3D{node1's IP} > > On node2: > > export SPARK_LOCAL_IP=3D{node2's IP} > > ...... > > > > BTW,I guess that the akka will retry 3 times when communicate between > master and slave, it is possible to increase the akka retries? > > > And except expand the network bandwidth, is there another way to solve > this problem? > > > thanks for any coming ideas. > > -------------------------------- > > Thanks&Best regards! > San.Luo > --001a1138ea3c9075b5051a15a1c2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Currently the number of retries is hardcoded.

You may want to open a JIRA which makes the retry count configurable= .

Cheers
On Thu, Jul 2, 2015 at 8:35 PM, <luohui200= 01@sina.com> wrote:

Hi t= here=EF=BC=8C

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i check the source= code and found that in org.apache.spark.deploy.client.AppClient, there is = a parameter tells(line 52):

=C2=A0 val REGISTRATION_TIMEOUT =3D 20.se= conds

=C2=A0 val REGISTRATION_RETRIES =3D 3

As I know If I = wanna increase the retry times, must I modify this value,rebuild the entire= Spark project and then redeply spark cluster with my modified version?

=

Or is there a better way to solve this issue?

Thanks.




--------------------------------
=C2=A0
Thanks&amp;Best regards!
San.Luo

----- =E5=8E=9F=E5=A7=8B=E9=82=AE=E4=BB=B6 -----=E5=8F=91=E4=BB=B6=E4=BA=BA=EF=BC=9A<luohui20001@sina.com>
=E6=94=B6=E4=BB=B6= =E4=BA=BA=EF=BC=9A"user" <user@spark.apache.org>
=E4=B8=BB=E9=A2=98= =EF=BC=9AAll master are unreponsive issue
=E6=97=A5=E6=9C=9F=EF=BC=9A201= 5=E5=B9=B407=E6=9C=8802=E6=97=A5 17=E7=82=B931=E5=88=86

Hi there:

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I got an= problem that "Application has been killed.Reason:All masters are unre= sponsive!Giving up." I check the network I/O and found sometimes it is= really high when running my app. Pls refer to the attached pic for more in= fo.

I also checked http://databricks.gitbooks.io/databricks-spark-knowledge-ba= se/content/troubleshooting/connectivity_issues.html, and set SPARK_LOCA= L_IP in every node's spark-env.sh of my spark cluster. However it does = not benifit in solving this problem.

I am not sure if this parameter = is correctly set,my setting is like this:

On node1:

export SPAR= K_LOCAL_IP=3D{node1's IP}

On node2:

export SPARK_LOCAL_IP= =3D{node2's IP}

......



BTW,I guess th= at the akka will retry 3 times when communicate between master and slave, i= t is possible to increase the akka retries?


And except= expand the network bandwidth, is there another way to solve this problem?<= /p>


thanks for any coming ideas.


-------------= -------------------
=C2=A0
Thanks&amp;Best regards!
San.Luo

--001a1138ea3c9075b5051a15a1c2--