Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6833211BFC for ; Wed, 21 May 2014 09:42:33 +0000 (UTC) Received: (qmail 4089 invoked by uid 500); 21 May 2014 09:42:32 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 4041 invoked by uid 500); 21 May 2014 09:42:32 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@spark.apache.org Delivered-To: mailing list user@spark.apache.org Received: (qmail 4033 invoked by uid 99); 21 May 2014 09:42:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 May 2014 09:42:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of zhpeng.is@gmail.com designates 209.85.128.178 as permitted sender) Received: from [209.85.128.178] (HELO mail-ve0-f178.google.com) (209.85.128.178) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 May 2014 09:42:29 +0000 Received: by mail-ve0-f178.google.com with SMTP id sa20so2189389veb.9 for ; Wed, 21 May 2014 02:42:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=8dwZ/WrnOyLgSguST36tpOSw+DE2YMW2oArG1QExOZY=; b=ZBizY2Bnznz59jpQAo+diSW/bIcQdogTkX82wawVQj79cJW9F5i16uNbk9nZW9j6Dw ToFKXxEaAO/4Qomr8gl30xBTRUE4d4yna2Je+QxRN1THttlcxstxksmNbh2DqaZHUDVv Uhq3EVhmBNukb/mX2EP3ioRedMr1GxU9iW1XIOHkViYVaHmYLWcjj/zXEWC3bUtxeOyY XmkGUgn6KHlQ1S5wpoKf7+aHEzYX0gB5uthytvonuRUvsD5mWKYj9e3MZgRUPXd8lXNB tcjsXEoic3NQgfoCvNlFvL0UiitJ/Mp3bmNJYObuxU3g5RYmH5bUmzgGn7c2qII6zjBO yQNQ== MIME-Version: 1.0 X-Received: by 10.52.143.6 with SMTP id sa6mr3396616vdb.22.1400665326069; Wed, 21 May 2014 02:42:06 -0700 (PDT) Received: by 10.58.145.104 with HTTP; Wed, 21 May 2014 02:42:06 -0700 (PDT) In-Reply-To: References: <76AEBC5D-ECEF-4244-BCE1-B455CA13A71A@gmail.com> Date: Wed, 21 May 2014 17:42:06 +0800 Message-ID: Subject: Re: advice on maintaining a production spark cluster? From: sagi To: user@spark.apache.org Content-Type: multipart/alternative; boundary=047d7b5d576435ac7304f9e5cf10 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b5d576435ac7304f9e5cf10 Content-Type: text/plain; charset=UTF-8 if you saw some exception message like the JIRA https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log file, you are welcome to have a try https://github.com/apache/spark/pull/827 On Wed, May 21, 2014 at 11:21 AM, Josh Marcus wrote: > Aaron: > > I see this in the Master's logs: > > 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same > address: akka.tcp://sparkWorker@hdn3.int.meetup.com:50038 > 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker > worker-20140520011737-hdn3.int.meetup.com-50038 > > There was an executor that launched that did fail, such as: > 14/05/20 01:16:05 INFO Master: Launching executor > app-20140520011605-0001/2 on worker > worker-20140519155427-hdn3.int.meetup.com-50 > 038 > 14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2 > because it is FAILED > > ... but other executors on other machines also failed without permanently > disassociating. > > There are these messages which I don't know if they are related: > 14/05/20 01:17:38 INFO LocalActorRef: Message > [akka.remote.transport.AssociationHandle$Disassociated] from > Actor[akka://sparkMaste > r/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3. > 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters > encountered. This logging can be turned off or adjusted with confi > guration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/05/20 01:17:38 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka > ://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM > aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead > letters encountered. This logging can be turned off or adjust > ed with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > > > > > On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson wrote: > >> Unfortunately, those errors are actually due to an Executor that exited, >> such that the connection between the Worker and Executor failed. This is >> not a fatal issue, unless there are analogous messages from the Worker to >> the Master (which should be present, if they exist, at around the same >> point in time). >> >> Do you happen to have the logs from the Master that indicate that the >> Worker terminated? Is it just an Akka disassociation, or some exception? >> >> >> On Tue, May 20, 2014 at 12:53 PM, Sean Owen wrote: >> >>> This isn't helpful of me to say, but, I see the same sorts of problem >>> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight >>> into when it happens, but usually after heavy use and after running >>> for a long time. I had figured I'd see if the changes since 0.9.0 >>> addressed it and revisit later. >>> >>> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus wrote: >>> > So, for example, I have two disassociated worker machines at the >>> moment. >>> > The last messages in the spark logs are akka association error >>> messages, >>> > like the following: >>> > >>> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError >>> > [akka.tcp://sparkWorker@hdn3.int.meetup.com:50038] -> >>> > [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]: Error >>> [Association >>> > failed with [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]] [ >>> > akka.remote.EndpointAssociationException: Association failed with >>> > [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288] >>> > Caused by: >>> > >>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >>> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 >>> > ] >>> > >>> > On the master side, there are lots and lots of messages of the form: >>> > >>> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker >>> > worker-20140520011737-hdn3.int.meetup.com-50038 >>> > >>> > --j >>> > >>> > >>> >> >> > -- --------------------------------- Best Regards --047d7b5d576435ac7304f9e5cf10 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
if you saw some exception message like the JIRA =C2=A0https://issues.apa= che.org/jira/browse/SPARK-1886=C2=A0=C2=A0mentioned in work's log f= ile, you are=C2=A0welcome to have a try=C2=A0https://github.com/apache/spark/pull/827




On Wed, May 21, 2014 at 11:21 AM, Josh Marcus <jmarcu= s@meetup.com> wrote:
Aaron:

I= see this in the Master's logs:

14/05/20 = 01:17:37 INFO Master: Attempted to re-register worker at same address: akka= .tcp://sparkWorker@hdn3.int.meetup.com:50038
14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker = worker-20140520011737-hdn3.int.meetup.com-50038

<= div>There was an executor that launched that did fail, such as:
14/05/20 01:16:05 INFO Master: Launching executor app-20140520011605-0= 001/2 on worker worker-20140519155427-hdn3.int.meetup.com-50
038<= /div>
14/05/20 01:17:37 INFO Master: Removing executor app-201405= 20011605-0001/2 because it is FAILED

... but other executors on other machines also fa= iled without permanently disassociating.

There are= these messages which I don't know if they are related:
14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.Associ= ationHandle$Disassociated] from Actor[akka://sparkMaste
r/deadLet= ters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp= 0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.
6.19%3A47252-18#1027788678] was not delivered. [3] dead letters encoun= tered. This logging can be turned off or adjusted with confi
gura= tion settings 'akka.log-dead-letters' and 'akka.log-dead-letter= s-during-shutdown'.
14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.A= ctorTransportAdapter$DisassociateUnderlying] from Actor[akka
= ://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/a= kkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM
aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead l= etters encountered. This logging can be turned off or adjust
ed w= ith configuration settings 'akka.log-dead-letters' and 'akka.lo= g-dead-letters-during-shutdown'.




On Tu= e, May 20, 2014 at 10:13 PM, Aaron Davidson <ilikerps@gmail.com>= wrote:
Unfortunately, those errors= are actually due to an Executor that exited, such that the connection betw= een the Worker and Executor failed. This is not a fatal issue, unless there= are analogous messages from the Worker to the Master (which should be pres= ent, if they exist, at around the same point in time).

Do you happen to have the logs from the Master that indicate= that the Worker terminated? Is it just an Akka disassociation, or some exc= eption?


On Tue, May 20, 2014 at 12:53 PM, Sean Owen <sowen@cloudera.com>= wrote:
This isn't helpful of me to say, but, I see the same sorts of problem and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight into when it happens, but usually after heavy use and after running
for a long time. I had figured I'd see if the changes since 0.9.0
addressed it and revisit later.

On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmarcus@meetup.com> wrote:
> So, for example, I have two disassociated worker machines at the momen= t.
> The last messages in the spark logs are akka association error message= s,
> like the following:
>
> 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@hdn3.int.meetup.com:50038] ->
> [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]: Error [Ass= ociation
> failed with [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]= ] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$= 2:
> Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
> ]
>
> On the master side, there are lots and lots of messages of the form: >
> 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker<= br> > worker-20140520011737-hdn3.int.meetup.com-50038
>
> --j
>
>





--
=
---------------------------------
Best Regards
--047d7b5d576435ac7304f9e5cf10--