Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 52CD118761 for ; Fri, 8 Jan 2016 01:41:40 +0000 (UTC) Received: (qmail 38501 invoked by uid 500); 8 Jan 2016 01:41:40 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 38376 invoked by uid 500); 8 Jan 2016 01:41:40 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 38340 invoked by uid 99); 8 Jan 2016 01:41:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jan 2016 01:41:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 06AA62C1F6C for ; Fri, 8 Jan 2016 01:41:40 +0000 (UTC) Date: Fri, 8 Jan 2016 01:41:40 +0000 (UTC) From: "Josh Rosen (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-4991) Worker should reconnect to Master when Master actor restart MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-4991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088560#comment-15088560 ] Josh Rosen commented on SPARK-4991: ----------------------------------- Is this still relevant after we remove the Akka RPC / use of Akka internally in Spark? > Worker should reconnect to Master when Master actor restart > ----------------------------------------------------------- > > Key: SPARK-4991 > URL: https://issues.apache.org/jira/browse/SPARK-4991 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core > Affects Versions: 1.0.0, 1.1.0, 1.2.0 > Reporter: Zhang, Liye > > This is a following JIRA of [SPARK-4989|https://issues.apache.org/jira/browse/SPARK-4989]. when Master akka actor encounter an exception, the Master will restart (akka actor restart not JVM restart). And all old information are cleared on Master (including workers, applications, etc). However, the workers are not aware of this at all. The state of the cluster is that: the master is on, and all workers are also on, but master is not aware of the exists of workers, and will ignore all worker's heartbeat because all workers are not registered. So that the whole cluster is not available. > For some other information about this part, please refer to [SPARK-3736|https://issues.apache.org/jira/browse/SPARK-3736] and [SPARK-4592|https://issues.apache.org/jira/browse/SPARK-4592] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org