spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sagi <zhpeng...@gmail.com>
Subject Re: advice on maintaining a production spark cluster?
Date Wed, 21 May 2014 09:42:06 GMT
if you saw some exception message like the JIRA
https://issues.apache.org/jira/browse/SPARK-1886  mentioned in work's log
file, you are welcome to have a try https://github.com/apache/spark/pull/827




On Wed, May 21, 2014 at 11:21 AM, Josh Marcus <jmarcus@meetup.com> wrote:

> Aaron:
>
> I see this in the Master's logs:
>
> 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same
> address: akka.tcp://sparkWorker@hdn3.int.meetup.com:50038
> 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker
> worker-20140520011737-hdn3.int.meetup.com-50038
>
> There was an executor that launched that did fail, such as:
> 14/05/20 01:16:05 INFO Master: Launching executor
> app-20140520011605-0001/2 on worker
> worker-20140519155427-hdn3.int.meetup.com-50
> 038
> 14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2
> because it is FAILED
>
> ... but other executors on other machines also failed without permanently
> disassociating.
>
> There are these messages which I don't know if they are related:
> 14/05/20 01:17:38 INFO LocalActorRef: Message
> [akka.remote.transport.AssociationHandle$Disassociated] from
> Actor[akka://sparkMaste
> r/deadLetters] to
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.
> 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters
> encountered. This logging can be turned off or adjusted with confi
> guration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/05/20 01:17:38 INFO LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka
> ://sparkMaster/deadLetters] to
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM
> aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead
> letters encountered. This logging can be turned off or adjust
> ed with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
>
>
>
>
> On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson <ilikerps@gmail.com>wrote:
>
>> Unfortunately, those errors are actually due to an Executor that exited,
>> such that the connection between the Worker and Executor failed. This is
>> not a fatal issue, unless there are analogous messages from the Worker to
>> the Master (which should be present, if they exist, at around the same
>> point in time).
>>
>> Do you happen to have the logs from the Master that indicate that the
>> Worker terminated? Is it just an Akka disassociation, or some exception?
>>
>>
>> On Tue, May 20, 2014 at 12:53 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> This isn't helpful of me to say, but, I see the same sorts of problem
>>> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
>>> into when it happens, but usually after heavy use and after running
>>> for a long time. I had figured I'd see if the changes since 0.9.0
>>> addressed it and revisit later.
>>>
>>> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmarcus@meetup.com> wrote:
>>> > So, for example, I have two disassociated worker machines at the
>>> moment.
>>> > The last messages in the spark logs are akka association error
>>> messages,
>>> > like the following:
>>> >
>>> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
>>> > [akka.tcp://sparkWorker@hdn3.int.meetup.com:50038] ->
>>> > [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]: Error
>>> [Association
>>> > failed with [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]] [
>>> > akka.remote.EndpointAssociationException: Association failed with
>>> > [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]
>>> > Caused by:
>>> >
>>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>>> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
>>> > ]
>>> >
>>> > On the master side, there are lots and lots of messages of the form:
>>> >
>>> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
>>> > worker-20140520011737-hdn3.int.meetup.com-50038
>>> >
>>> > --j
>>> >
>>> >
>>>
>>
>>
>


-- 
---------------------------------
Best Regards

Mime
View raw message