Return-Path: X-Original-To: apmail-storm-dev-archive@minotaur.apache.org Delivered-To: apmail-storm-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 64ED1100F8 for ; Wed, 9 Jul 2014 13:27:26 +0000 (UTC) Received: (qmail 34102 invoked by uid 500); 9 Jul 2014 13:27:26 -0000 Delivered-To: apmail-storm-dev-archive@storm.apache.org Received: (qmail 34047 invoked by uid 500); 9 Jul 2014 13:27:26 -0000 Mailing-List: contact dev-help@storm.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@storm.incubator.apache.org Delivered-To: mailing list dev@storm.incubator.apache.org Received: (qmail 34035 invoked by uid 99); 9 Jul 2014 13:27:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jul 2014 13:27:26 +0000 X-ASF-Spam-Status: No, hits=-2000.7 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 09 Jul 2014 13:27:24 +0000 Received: (qmail 33467 invoked by uid 99); 9 Jul 2014 13:27:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jul 2014 13:27:04 +0000 Date: Wed, 9 Jul 2014 13:27:04 +0000 (UTC) From: "Simon Cooper (JIRA)" To: dev@storm.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (STORM-397) Nimbus does not reassign a topology when the supervisor dies MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Simon Cooper created STORM-397: ---------------------------------- Summary: Nimbus does not reassign a topology when the supervisor dies Key: STORM-397 URL: https://issues.apache.org/jira/browse/STORM-397 Project: Apache Storm (Incubating) Issue Type: Bug Affects Versions: 0.9.2-incubating Environment: 2 topologies, 3 supervisors Reporter: Simon Cooper Priority: Critical We're running two topologies on a cluster with 3 supervisors. By default, both topologies are assigned onto the same supervisor. If that supervisor dies, storm reassigns one topology to another supervisor but not the other, leaving the second topology inactive. There are various symptoms/possible causes of this problem. In the nimbus logs, from when the topologies are initially submitted, nimbus is continually trying to reassign the second topology to the same supervisor every 10 seconds: ... 2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509: #backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831, [12 12] 1404911831, [9 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}} 2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509: #backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841, [12 12] 1404911841, [9 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}} 2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509: #backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852, [12 12] 1404911852, [9 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}} ... These log messages continue after the supervisor it's running on dies - nimbus continually tries to reassign to a dead supervisor. Note that the other topology is reassigned elsewhere without problems. If the broken topology is rebalanced, only then does nimbus assign the topology to a working supervisor. Another symptom of this is that, when the machines running storm are started, only one topology is running on startup. The second topology is not assigned to a supervisor. Again, it takes a rebalance for nimbus to actually assign the topology somewhere. A couple of possibly related bugs are STORM-256 and STORM-341, but I don't really understand those bugs enough to be able to link it to these problems. This is a major issue for us. One of the reasons for using storm is that if a supervisor were to die, storm would automatically fail over to another supervisor. This does not happen, leaving our cluster with a SPOF. -- This message was sent by Atlassian JIRA (v6.2#6252)