Mailing-List: contact user-help@spark.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@spark.incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of scrapcodes@gmail.com
 designates 74.125.82.53 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAO24D=Spd+vOqBKnb21cPJVyL_Zh1rc5sBt5_auMKW69aO41jw@mail.gmail.com>
References: 
 <CAO24D=QZpw3D=ijCCT2Ua+h+BHhOyzXtF4cjKDE32pU=QODxyw@mail.gmail.com>
 <CAOYDGoDcmiC5yEpSvhTJcPMOyiLcA30tjwr0EjwkmY0-BC+kKA@mail.gmail.com>
 <CAOYDGoCTeb+mryA=-q6L0F+oGiKvXob_OM=X5Gw6AhktvMcd6w@mail.gmail.com>
 <CAO24D=SurPswsv0g+JeRyteioCNLTaZ6=3DHzQ-zmBjBmGrq4w@mail.gmail.com>
 <CAOYDGoA=sDxhOO6oXc5dQiH9rEksvGc0E41VsQnag21LMTQsYQ@mail.gmail.com>
 <CAO24D=RQHBEaqrDtfM=UEb8N=28ouFqqMrhjEGGcyw+mwqruBQ@mail.gmail.com>
 <CAO24D=Spd+vOqBKnb21cPJVyL_Zh1rc5sBt5_auMKW69aO41jw@mail.gmail.com>
From: Prashant Sharma <scrapcodes@gmail.com>
Date: Wed, 30 Oct 2013 20:39:33 +0530
Message-ID: 
 <CAOYDGoADQDarx1v--7UZ+eYRNT5Wx9Ah6q-fj26uCjC++4GQDg@mail.gmail.com>
Subject: Re: executor failures w/ scala 2.10
To: user@spark.incubator.apache.org
Content-Type: multipart/alternative; boundary=f46d04182562b18bff04e9f6b943

--f46d04182562b18bff04e9f6b943
Content-Type: text/plain; charset=ISO-8859-1

Can you apply this patch too and check the logs of Driver and worker.

diff --git
a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
index b6f0ec9..ad0ebf7 100644
---
a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
+++
b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
@@ -132,7 +132,7 @@ class StandaloneSchedulerBackend(scheduler:
ClusterScheduler, actorSystem: Actor
     // Remove a disconnected slave from the cluster
     def removeExecutor(executorId: String, reason: String) {
       if (executorActor.contains(executorId)) {
-        logInfo("Executor " + executorId + " disconnected, so removing it")
+        logInfo("Executor " + executorId + " disconnected, so removing it,
reason:" + reason)
         val numCores = freeCores(executorId)
         actorToExecutorId -= executorActor(executorId)
         addressToExecutorId -= executorAddress(executorId)


On Wed, Oct 30, 2013 at 8:18 PM, Imran Rashid <imran@quantifind.com> wrote:

> I just realized something about the failing stages -- they generally occur
> in steps like this:
>
> rdd.mapPartitions{itr =>
>   val myCounters = initializeSomeDataStructure()
>   itr.foreach{
>     //update myCounter in here
>     ...
>   }
>
>   myCounters.iterator.map{
>     //some other transformation here ...
>   }
> }
>
> that is, as a partition is processed, nothing gets output, we just
> accumulate some values.  Only at the end of the partition do we output some
> accumulated values.
>
> These stages don't always fail, and generally they do succeed after the
> executor has died and a new one has started -- so I'm pretty confident its
> not a problem w/ the code.  But maybe we need to add something like a
> periodic heartbeat in this kind of operation?
>
>
>
> On Wed, Oct 30, 2013 at 8:56 AM, Imran Rashid <imran@quantifind.com>wrote:
>
>> I'm gonna try turning on more akka debugging msgs as described at
>> http://akka.io/faq/
>> and
>>
>> http://doc.akka.io/docs/akka/current/scala/testing.html#Tracing_Actor_Invocations
>>
>> unfortunately that will require a patch to spark, but hopefully that will
>> give us more info to go on ...
>>
>>
>> On Wed, Oct 30, 2013 at 8:10 AM, Prashant Sharma <scrapcodes@gmail.com>wrote:
>>
>>> I have things running (from scala 2.10 branch) for over 3-4 hours now
>>> without a problem and my jobs write data about the same as you suggested.
>>> My cluster size is 7 nodes and not *congested* for memory. I going to leave
>>> jobs running all night long. Meanwhile I had encourage you to try to spot
>>> the problem such that it is reproducible that can help a ton in fixing the
>>> issue.
>>>
>>> Thanks for testing and reporting your experience. I still feel there is
>>> something else wrong !. About tolerance for network connection timeouts,
>>> setting those properties should work, but I am afraid about Disassociation
>>> Event though. I will have to check this is indeed hard to reproduce bug if
>>> it is, I mean how do I simulate network delays ?
>>>
>>>
>>> On Wed, Oct 30, 2013 at 6:05 PM, Imran Rashid <imran@quantifind.com>wrote:
>>>
>>>> This is a spark-standalone setup (not mesos), on our own cluster.
>>>>
>>>> At first I thought it must be some temporary network problem too -- but
>>>> the times between receiving task completion events from an executor and
>>>> declaring it failed are really small, so I didn't think that could possibly
>>>> be it.  Plus we tried increasing various akka timeouts, but that didn't
>>>> help.  Or maybe there are some other spark / akka properities we should be
>>>> setting?  It certainly should be resilient to such a temporary network
>>>> issue, if that is the problem.
>>>>
>>>> btw, I think I've noticed this happens most often during
>>>> ShuffleMapTasks.  The tasks write out very small amounts of data (64 MB
>>>> total for the entire stage).
>>>>
>>>> thanks
>>>>
>>>> On Wed, Oct 30, 2013 at 6:47 AM, Prashant Sharma <scrapcodes@gmail.com>wrote:
>>>>
>>>>> Are you using mesos ? I admit to have not properly tested things on
>>>>> mesos though.
>>>>>
>>>>>
>>>>> On Wed, Oct 30, 2013 at 11:31 AM, Prashant Sharma <
>>>>> scrapcodes@gmail.com> wrote:
>>>>>
>>>>>> Those log messages are new to the Akka 2.2 and are usually seen when
>>>>>> a node is disassociated with other by either a network failure or even
>>>>>> clean shutdown. This suggests some network issue to me, are you running on
>>>>>> EC2 ? It might be a temporary thing in that case.
>>>>>>
>>>>>> I had like to have more details on the long jobs though, how long ?
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 30, 2013 at 1:29 AM, Imran Rashid <imran@quantifind.com>wrote:
>>>>>>
>>>>>>> We've been testing out the 2.10 branch of spark, and we're running
>>>>>>> into some issues were akka disconnects from the executors after a while.
>>>>>>> We ran some simple tests first, and all was well, so we started upgrading
>>>>>>> our whole codebase to 2.10.  Everything seemed to be working, but then we
>>>>>>> noticed that when we run long jobs, and then things start failing.
>>>>>>>
>>>>>>>
>>>>>>> The first suspicious thing is that we get akka warnings about
>>>>>>> undeliverable messages sent to deadLetters:
>>>>>>>
>>>>>>> 22013-10-29 11:03:54,577 [spark-akka.actor.default-dispatcher-17]
>>>>>>> INFO  akka.actor.LocalActorRef - Message
>>>>>>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
>>>>>>> Actor[akka://spark/deadLetters] to
>>>>>>> Actor[akka://spark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%4010.10.5.81%3A46572-3#656094700]
>>>>>>> was not delivered. [4] dead letters encountered. This logging can be turned
>>>>>>> off or adjusted with configuration settings 'akka.log-dead-letters' and
>>>>>>> 'akka.log-dead-letters-during-shutdown'.
>>>>>>>
>>>>>>> 2013-10-29 11:03:54,579 [spark-akka.actor.default-dispatcher-19]
>>>>>>> INFO  akka.actor.LocalActorRef - Message
>>>>>>> [akka.remote.transport.AssociationHandle$Disassociated] from
>>>>>>> Actor[akka://spark/deadLetters] to
>>>>>>> Actor[akka://spark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%4010.10.5.81%3A46572-3#656094700]
>>>>>>> was not delivered. [5] dead letters encountered. This logging can be turned
>>>>>>> off or adjusted with configuration settings 'akka.log-dead-letters' and
>>>>>>> 'akka.log-dead-letters-during-shutdown'.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Generally within a few seconds after the first such message, there
>>>>>>> are a bunch more, and then the executor is marked as failed, and a new one
>>>>>>> is started:
>>>>>>>
>>>>>>> 2013-10-29 11:03:58,775 [spark-akka.actor.default-dispatcher-3]
>>>>>>> INFO  akka.actor.LocalActorRef - Message
>>>>>>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
>>>>>>> Actor[akka://spark/deadLetters] to
>>>>>>> Actor[akka://spark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkExecutor%
>>>>>>> 40dhd2.quantifind.com%3A45794-6#-890135716] was not delivered. [10]
>>>>>>> dead letters encountered, no more dead letters will be logged. This logging
>>>>>>> can be turned off or adjusted with configuration settings
>>>>>>> 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
>>>>>>>
>>>>>>> 2013-10-29 11:03:58,778 [spark-akka.actor.default-dispatcher-17]
>>>>>>> INFO  org.apache.spark.deploy.client.Client$ClientActor - Executor updated:
>>>>>>> app-20131029110000-0000/1 is now FAILED (Command exited with code 1)
>>>>>>>
>>>>>>> 2013-10-29 11:03:58,784 [spark-akka.actor.default-dispatcher-17]
>>>>>>> INFO  org.apache.spark.deploy.client.Client$ClientActor - Executor added:
>>>>>>> app-20131029110000-0000/2 on
>>>>>>> worker-20131029105824-dhd2.quantifind.com-51544 (
>>>>>>> dhd2.quantifind.com:51544) with 24 cores
>>>>>>>
>>>>>>> 2013-10-29 11:03:58,784 [spark-akka.actor.default-dispatcher-18]
>>>>>>> ERROR akka.remote.EndpointWriter - AssociationError [akka.tcp://
>>>>>>> spark@ddd0.quantifind.com:43068] -> [akka.tcp://
>>>>>>> sparkExecutor@dhd2.quantifind.com:45794]: Error [Association failed
>>>>>>> with [akka.tcp://sparkExecutor@dhd2.quantifind.com:45794]] [
>>>>>>> akka.remote.EndpointAssociationException: Association failed with
>>>>>>> [akka.tcp://sparkExecutor@dhd2.quantifind.com:45794]
>>>>>>> Caused by:
>>>>>>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>>>>>>> Connection refused: dhd2.quantifind.com/10.10.5.64:45794]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Looking in the logs of the failed executor, there are some similar
>>>>>>> messages about undeliverable messages, but I don't see any reason:
>>>>>>>
>>>>>>> 13/10/29 11:03:52 INFO executor.Executor: Finished task ID 943
>>>>>>>
>>>>>>> 13/10/29 11:03:53 INFO actor.LocalActorRef: Message
>>>>>>> [akka.actor.FSM$Timer] from Actor[akka://sparkExecutor/deadLetters] to
>>>>>>> Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%
>>>>>>> 40ddd0.quantifind.com%3A43068-1#772172548] was not delivered. [1]
>>>>>>> dead letters encountered. This logging can be turned off or adjusted with
>>>>>>> configuration settings 'akka.log-dead-letters' and
>>>>>>> 'akka.log-dead-letters-during-shutdown'.
>>>>>>>
>>>>>>> 13/10/29 11:03:53 INFO actor.LocalActorRef: Message
>>>>>>> [akka.remote.transport.AssociationHandle$Disassociated] from
>>>>>>> Actor[akka://sparkExecutor/deadLetters] to
>>>>>>> Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%
>>>>>>> 40ddd0.quantifind.com%3A43068-1#772172548] was not delivered. [2]
>>>>>>> dead letters encountered. This logging can be turned off or adjusted with
>>>>>>> configuration settings 'akka.log-dead-letters' and
>>>>>>> 'akka.log-dead-letters-during-shutdown'.
>>>>>>>
>>>>>>> 13/10/29 11:03:53 INFO actor.LocalActorRef: Message
>>>>>>> [akka.remote.transport.AssociationHandle$Disassociated] from
>>>>>>> Actor[akka://sparkExecutor/deadLetters] to
>>>>>>> Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%
>>>>>>> 40ddd0.quantifind.com%3A43068-1#772172548] was not delivered. [3]
>>>>>>> dead letters encountered. This logging can be turned off or adjusted with
>>>>>>> configuration settings 'akka.log-dead-letters' and
>>>>>>> 'akka.log-dead-letters-during-shutdown'.
>>>>>>>
>>>>>>> 13/10/29 11:03:53 ERROR executor.StandaloneExecutorBackend: Driver
>>>>>>> terminated or disconnected! Shutting down.
>>>>>>>
>>>>>>> 13/10/29 11:03:53 INFO actor.LocalActorRef: Message
>>>>>>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
>>>>>>> Actor[akka://sparkExecutor/deadLetters] to
>>>>>>> Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%
>>>>>>> 40ddd0.quantifind.com%3A43068-1#772172548] was not delivered. [4]
>>>>>>> dead letters encountered. This logging can be turned off or adjusted with
>>>>>>> configuration settings 'akka.log-dead-letters' and
>>>>>>> 'akka.log-dead-letters-during-shutdown'.
>>>>>>>
>>>>>>>
>>>>>>> After this happens, spark does launch a new executor successfully,
>>>>>>> and continue the job.  Sometimes, the job just continues happily and there
>>>>>>> aren't any other problems.  However, that executor may have to run a bunch
>>>>>>> of steps to re-compute some cached RDDs -- and during that time, another
>>>>>>> executor may crash similarly, and then we end up in a never ending loop, of
>>>>>>> one executor crashing, then trying to reload data, while the others sit
>>>>>>> around.
>>>>>>>
>>>>>>> I have no idea what is triggering this behavior -- there isn't any
>>>>>>> particular point in the job that it regularly occurs at.  Certain steps
>>>>>>> seem more prone to this, but there isn't any step which regularly causes
>>>>>>> the problem.  In a long pipeline of steps, though, that loop becomes very
>>>>>>> likely.  I don't think its a timeout issue -- the initial failing executors
>>>>>>> can be actively completing stages just seconds before this failure
>>>>>>> happens.  We did try adjusting some of the spark / akka timeouts:
>>>>>>>
>>>>>>>     -Dspark.storage.blockManagerHeartBeatMs=300000
>>>>>>>     -Dspark.akka.frameSize=150
>>>>>>>     -Dspark.akka.timeout=120
>>>>>>>     -Dspark.akka.askTimeout=30
>>>>>>>     -Dspark.akka.logLifecycleEvents=true
>>>>>>>
>>>>>>> but those settings didn't seem to help the problem at all.  I figure
>>>>>>> it must be some configuration with the new version of akka that we're
>>>>>>> missing, but we haven't found anything.  Any ideas?
>>>>>>>
>>>>>>> our code works fine w/ the 0.8.0 release on scala 2.9.3.  The
>>>>>>> failures occur on the tip of the scala-2.10 branch (5429d62d)
>>>>>>>
>>>>>>> thanks,
>>>>>>> Imran
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> s
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> s
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> s
>>>
>>
>>
>


-- 
s

--f46d04182562b18bff04e9f6b943
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Can you apply this patch too and check the logs of Driver =
and worker.<div><br></div><div><div>diff --git a/core/src/main/scala/org/ap=
ache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala b/core/src/ma=
in/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scal=
a</div>

<div>index b6f0ec9..ad0ebf7 100644</div><div>--- a/core/src/main/scala/org/=
apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala</div><div>+=
++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSched=
ulerBackend.scala</div>

<div>@@ -132,7 +132,7 @@ class StandaloneSchedulerBackend(scheduler: Cluste=
rScheduler, actorSystem: Actor</div><div>=A0 =A0 =A0// Remove a disconnecte=
d slave from the cluster</div><div>=A0 =A0 =A0def removeExecutor(executorId=
: String, reason: String) {</div>

<div>=A0 =A0 =A0 =A0if (executorActor.contains(executorId)) {</div><div>- =
=A0 =A0 =A0 =A0logInfo(&quot;Executor &quot; + executorId + &quot; disconne=
cted, so removing it&quot;)</div><div>+ =A0 =A0 =A0 =A0logInfo(&quot;Execut=
or &quot; + executorId + &quot; disconnected, so removing it, reason:&quot;=
 + reason)</div>

<div>=A0 =A0 =A0 =A0 =A0val numCores =3D freeCores(executorId)</div><div>=
=A0 =A0 =A0 =A0 =A0actorToExecutorId -=3D executorActor(executorId)</div><d=
iv>=A0 =A0 =A0 =A0 =A0addressToExecutorId -=3D executorAddress(executorId)<=
/div></div><div><br></div><div><br>

</div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">O=
n Wed, Oct 30, 2013 at 8:18 PM, Imran Rashid <span dir=3D"ltr">&lt;<a href=
=3D"mailto:imran@quantifind.com" target=3D"_blank">imran@quantifind.com</a>=
&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><d=
iv>I just realized something about the failing stages -- they generally occ=
ur in steps like this:<br>

<br></div>rdd.mapPartitions{itr =3D&gt;<br></div>=A0 val myCounters =3D ini=
tializeSomeDataStructure()<br>
</div>=A0 itr.foreach{<br><div>=A0=A0=A0 //update myCounter in here<br></di=
v>=A0=A0=A0 ...<br></div>=A0 }<br>=A0 <br>=A0 myCounters.iterator.map{ <br>=
=A0=A0=A0 //some other transformation here ...<br>=A0 }<br>}<br><br></div>t=
hat is, as a partition is processed, nothing gets output, we just accumulat=
e some values.=A0 Only at the end of the partition do we output some accumu=
lated values.<br>


<br></div>These stages don&#39;t always fail, and generally they do succeed=
 after the executor has died and a new one has started -- so I&#39;m pretty=
 confident its not a problem w/ the code.=A0 But maybe we need to add somet=
hing like a periodic heartbeat in this kind of operation?<br>


<div><div><br></div></div></div><div class=3D"HOEnZb"><div class=3D"h5"><di=
v class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed, Oct 30, =
2013 at 8:56 AM, Imran Rashid <span dir=3D"ltr">&lt;<a href=3D"mailto:imran=
@quantifind.com" target=3D"_blank">imran@quantifind.com</a>&gt;</span> wrot=
e:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>I&#39;m gonna try=
 turning on more akka debugging msgs as described at<br><a href=3D"http://a=
kka.io/faq/" target=3D"_blank">http://akka.io/faq/</a><br>


</div>and<br><a href=3D"http://doc.akka.io/docs/akka/current/scala/testing.=
html#Tracing_Actor_Invocations" target=3D"_blank">http://doc.akka.io/docs/a=
kka/current/scala/testing.html#Tracing_Actor_Invocations</a><br>
<br></div>unfortunately that will require a patch to spark, but hopefully t=
hat will give us more info to go on ...<br></div><div><div><div class=3D"gm=
ail_extra"><br><br><div class=3D"gmail_quote">On Wed, Oct 30, 2013 at 8:10 =
AM, Prashant Sharma <span dir=3D"ltr">&lt;<a href=3D"mailto:scrapcodes@gmai=
l.com" target=3D"_blank">scrapcodes@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">I have things running (from=
 scala 2.10 branch) for over 3-4 hours now without a problem and my jobs wr=
ite data about the same as you suggested. My cluster size is 7 nodes and no=
t *congested* for memory. I going to leave jobs running all night long. Mea=
nwhile I had encourage you to try to spot the problem such that it is repro=
ducible that can help a ton in fixing the issue.=A0<br>


<br>Thanks for testing and reporting your experience. I still feel there is=
 something else wrong !. About tolerance for network connection timeouts, s=
etting those properties should work, but I am afraid about Disassociation E=
vent though. I will have to check this is indeed hard to reproduce bug if i=
t is, I mean how do I simulate network delays ?</div>


<div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quote">On =
Wed, Oct 30, 2013 at 6:05 PM, Imran Rashid <span dir=3D"ltr">&lt;<a href=3D=
"mailto:imran@quantifind.com" target=3D"_blank">imran@quantifind.com</a>&gt=
;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>This is a spark-standa=
lone setup (not mesos), on our own cluster.<br><br></div>At first I thought=
 it must be some temporary network problem too -- but the times between rec=
eiving task completion events from an executor and declaring it failed are =
really small, so I didn&#39;t think that could possibly be it.=A0 Plus we t=
ried increasing various akka timeouts, but that didn&#39;t help.=A0 Or mayb=
e there are some other spark / akka properities we should be setting?=A0 It=
 certainly should be resilient to such a temporary network issue, if that i=
s the problem.<br>


<br>btw, I think I&#39;ve noticed this happens most often during ShuffleMap=
Tasks.=A0 The tasks write out very small amounts of data (64 MB total for t=
he entire stage).<br><div><div><div><div class=3D"gmail_extra"><br></div><d=
iv class=3D"gmail_extra">


thanks<br></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmai=
l_quote">On Wed, Oct 30, 2013 at 6:47 AM, Prashant Sharma <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:scrapcodes@gmail.com" target=3D"_blank">scrapcodes@g=
mail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Are you using mesos ? I adm=
it to have not properly tested things on mesos though.=A0</div><div class=
=3D"gmail_extra">


<div><div><br><br><div class=3D"gmail_quote">On Wed, Oct 30, 2013 at 11:31 =
AM, Prashant Sharma <span dir=3D"ltr">&lt;<a href=3D"mailto:scrapcodes@gmai=
l.com" target=3D"_blank">scrapcodes@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Those log messages are new =
to the Akka 2.2 and are usually seen when a node is disassociated with othe=
r by either a network failure or even clean shutdown. This suggests some ne=
twork issue to me, are you running on EC2 ? It might be a temporary thing i=
n that case.=A0<div>


<br></div><div>I had like to have more details on the long jobs though, how=
 long ?=A0</div></div><div class=3D"gmail_extra"><div><div><br><br><div cla=
ss=3D"gmail_quote">On Wed, Oct 30, 2013 at 1:29 AM, Imran Rashid <span dir=
=3D"ltr">&lt;<a href=3D"mailto:imran@quantifind.com" target=3D"_blank">imra=
n@quantifind.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">We&#39;ve been testing out =
the 2.10 branch of spark, and
 we&#39;re running into some issues were akka disconnects from the executor=
s
 after a while.=A0 We ran some simple tests first, and all was well, so we
 started upgrading our whole codebase to 2.10.=A0 Everything seemed to be=
=20
working, but then we noticed that when we run long jobs, and then things
 start failing.<br>
<br><br>The first suspicious thing is that we get akka warnings about undel=
iverable messages sent to deadLetters:<br><br><div style=3D"margin-left:40p=
x">22013-10-29 11:03:54,577 [spark-akka.actor.default-dispatcher-17] INFO=
=A0 akka.actor.LocalActorRef - Message [akka.remote.transport.ActorTranspor=
tAdapter$DisassociateUnderlying] from Actor[akka://spark/deadLetters] to Ac=
tor[akka://spark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tc=
p%3A%2F%2Fspark%4010.10.5.81%3A46572-3#656094700]
 was not delivered. [4] dead letters encountered. This logging can be=20
turned off or adjusted with configuration settings=20
&#39;akka.log-dead-letters&#39; and &#39;akka.log-dead-letters-during-shutd=
own&#39;.<br>
<br>2013-10-29 11:03:54,579 [spark-akka.actor.default-dispatcher-19] INFO=
=A0 akka.actor.LocalActorRef - Message [akka.remote.transport.AssociationHa=
ndle$Disassociated] from Actor[akka://spark/deadLetters] to Actor[akka://sp=
ark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspa=
rk%4010.10.5.81%3A46572-3#656094700]
 was not delivered. [5] dead letters encountered. This logging can be=20
turned off or adjusted with configuration settings=20
&#39;akka.log-dead-letters&#39; and &#39;akka.log-dead-letters-during-shutd=
own&#39;.<br>
<br></div><br><br>Generally within a few seconds after the=20
first such message, there are a bunch more, and then the executor is=20
marked as failed, and a new one is started:<br><br><div style=3D"margin-lef=
t:40px">
2013-10-29 11:03:58,775 [spark-akka.actor.default-dispatcher-3] INFO=A0 akk=
a.actor.LocalActorRef - Message [akka.remote.transport.ActorTransportAdapte=
r$DisassociateUnderlying] from Actor[akka://spark/deadLetters] to Actor[akk=
a://spark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F=
%2FsparkExecutor%<a href=3D"http://40dhd2.quantifind.com" target=3D"_blank"=
>40dhd2.quantifind.com</a>%3A45794-6#-890135716]
 was not delivered. [10] dead letters encountered, no more dead letters=20
will be logged. This logging can be turned off or adjusted with=20
configuration settings &#39;akka.log-dead-letters&#39; and=20
&#39;akka.log-dead-letters-during-shutdown&#39;.<br>
<br>2013-10-29 11:03:58,778 [spark-akka.actor.default-dispatcher-17] INFO=
=A0 org.apache.spark.deploy.client.Client$ClientActor - Executor updated: a=
pp-20131029110000-0000/1 is now FAILED (Command exited with code 1)<br><br>


2013-10-29 11:03:58,784 [spark-akka.actor.default-dispatcher-17] INFO=A0 or=
g.apache.spark.deploy.client.Client$ClientActor - Executor added: app-20131=
029110000-0000/2 on worker-20131029105824-dhd2.quantifind.com-51544 (<a hre=
f=3D"http://dhd2.quantifind.com:51544" target=3D"_blank">dhd2.quantifind.co=
m:51544</a>) with 24 cores<br>


<br>2013-10-29 11:03:58,784 [spark-akka.actor.default-dispatcher-18] ERROR =
akka.remote.EndpointWriter - AssociationError [akka.tcp://<a href=3D"http:/=
/spark@ddd0.quantifind.com:43068" target=3D"_blank">spark@ddd0.quantifind.c=
om:43068</a>] -&gt; [akka.tcp://<a href=3D"http://sparkExecutor@dhd2.quanti=
find.com:45794" target=3D"_blank">sparkExecutor@dhd2.quantifind.com:45794</=
a>]: Error [Association failed with [akka.tcp://<a href=3D"http://sparkExec=
utor@dhd2.quantifind.com:45794" target=3D"_blank">sparkExecutor@dhd2.quanti=
find.com:45794</a>]] [<br>


akka.remote.EndpointAssociationException: Association failed with [akka.tcp=
://<a href=3D"http://sparkExecutor@dhd2.quantifind.com:45794" target=3D"_bl=
ank">sparkExecutor@dhd2.quantifind.com:45794</a>]<br>Caused by: akka.remote=
.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection re=
fused: <a href=3D"http://dhd2.quantifind.com/10.10.5.64:45794" target=3D"_b=
lank">dhd2.quantifind.com/10.10.5.64:45794</a>]<br>


</div><br><div><br></div><div><br></div><div>Looking in the=20
logs of the failed executor, there are some similar messages about=20
undeliverable messages, but I don&#39;t see any reason:<br><br><div style=
=3D"margin-left:40px">
13/10/29 11:03:52 INFO executor.Executor: Finished task ID 943<br><br>13/10=
/29 11:03:53 INFO actor.LocalActorRef: Message [akka.actor.FSM$Timer] from =
Actor[akka://sparkExecutor/deadLetters] to Actor[akka://sparkExecutor/syste=
m/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%<a hre=
f=3D"http://40ddd0.quantifind.com" target=3D"_blank">40ddd0.quantifind.com<=
/a>%3A43068-1#772172548]
 was not delivered. [1] dead letters encountered. This logging can be=20
turned off or adjusted with configuration settings=20
&#39;akka.log-dead-letters&#39; and &#39;akka.log-dead-letters-during-shutd=
own&#39;.<br>
<br>13/10/29 11:03:53 INFO actor.LocalActorRef: Message [akka.remote.transp=
ort.AssociationHandle$Disassociated] from Actor[akka://sparkExecutor/deadLe=
tters] to Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.=
tcp0/akkaProtocol-tcp%3A%2F%2Fspark%<a href=3D"http://40ddd0.quantifind.com=
" target=3D"_blank">40ddd0.quantifind.com</a>%3A43068-1#772172548]
 was not delivered. [2] dead letters encountered. This logging can be=20
turned off or adjusted with configuration settings=20
&#39;akka.log-dead-letters&#39; and &#39;akka.log-dead-letters-during-shutd=
own&#39;.<br>
<br>13/10/29 11:03:53 INFO actor.LocalActorRef: Message [akka.remote.transp=
ort.AssociationHandle$Disassociated] from Actor[akka://sparkExecutor/deadLe=
tters] to Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.=
tcp0/akkaProtocol-tcp%3A%2F%2Fspark%<a href=3D"http://40ddd0.quantifind.com=
" target=3D"_blank">40ddd0.quantifind.com</a>%3A43068-1#772172548]
 was not delivered. [3] dead letters encountered. This logging can be=20
turned off or adjusted with configuration settings=20
&#39;akka.log-dead-letters&#39; and &#39;akka.log-dead-letters-during-shutd=
own&#39;.<br>
<br>13/10/29 11:03:53 ERROR executor.StandaloneExecutorBackend: Driver term=
inated or disconnected! Shutting down.<br><br>13/10/29 11:03:53 INFO actor.=
LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$Disasso=
ciateUnderlying] from Actor[akka://sparkExecutor/deadLetters] to Actor[akka=
://sparkExecutor/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tc=
p%3A%2F%2Fspark%<a href=3D"http://40ddd0.quantifind.com" target=3D"_blank">=
40ddd0.quantifind.com</a>%3A43068-1#772172548]
 was not delivered. [4] dead letters encountered. This logging can be=20
turned off or adjusted with configuration settings=20
&#39;akka.log-dead-letters&#39; and &#39;akka.log-dead-letters-during-shutd=
own&#39;.<br>
</div><br></div><div><br></div><div>After this happens, spark does=20
launch a new executor successfully, and continue the job.=A0 Sometimes,=20
the job just continues happily and there aren&#39;t any other=20
problems.=A0=A0However, that executor may have to run a bunch of steps to=
=20
re-compute some cached RDDs -- and during that time, another executor=20
may crash similarly, and then we end up in a never ending loop, of one=20
executor crashing, then trying to reload data, while the others sit=20
around.<br>
<br></div><div>I have no idea what is triggering this behavior -- there=20
isn&#39;t any particular point in the job that it regularly occurs at.=A0=
=20
Certain steps seem more prone to this, but there isn&#39;t any step which=
=20
regularly causes the problem.=A0 In a long pipeline of steps, though, that
 loop becomes very likely.=A0 I don&#39;t think its a timeout issue -- the=
=20
initial failing executors can be actively completing stages just seconds
 before this failure happens.=A0 We did try adjusting some of the spark /=
=20
akka timeouts:<br>
<br>=A0=A0=A0 -Dspark.storage.blockManagerHeartBeatMs=3D300000<br>=A0=A0=A0=
 -Dspark.akka.frameSize=3D150<br>=A0=A0=A0 -Dspark.akka.timeout=3D120<br>=
=A0=A0=A0 -Dspark.akka.askTimeout=3D30<br>=A0 =A0 -Dspark.akka.logLifecycle=
Events=3Dtrue<br><br></div><div>but
 those settings didn&#39;t seem to help the problem at all.=A0 I figure it=
=20
must be some configuration with the new version of akka that we&#39;re=20
missing, but we haven&#39;t found anything.=A0 Any ideas?<br>
<br></div><div>our code works fine w/ the 0.8.0 release on scala=20
2.9.3.=A0 The failures occur on the tip of the scala-2.10 branch (<span><sp=
an>5429d62d</span></span>)<br><br></div>thanks,<br>Imran</div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>s
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>s
</font></span></div>
</blockquote></div><br></div></div></div></div></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>s
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
s
</div>

--f46d04182562b18bff04e9f6b943--