storm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Cooper (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (STORM-397) Nimbus does not reassign a topology when the supervisor dies
Date Wed, 09 Jul 2014 13:33:05 GMT

     [ https://issues.apache.org/jira/browse/STORM-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Simon Cooper updated STORM-397:
-------------------------------

    Description: 
We're running two topologies on a cluster with 3 supervisors. By default, both topologies
are assigned onto the same supervisor. If that supervisor dies, storm reassigns one topology
to another supervisor but not the other, leaving the second topology inactive.

There are various symptoms/possible causes of this problem. In the nimbus logs, from when
the topologies are initially submitted, nimbus is continually trying to reassign the second
topology to the same supervisor every 10 seconds:

{noformat}
2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509:
#backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509",
:node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
{[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703]}, :executor->start-time-secs {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831,
[12 12] 1404911831, [9 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}}
2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509:
#backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509",
:node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
{[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703]}, :executor->start-time-secs {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841,
[12 12] 1404911841, [9 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}}
2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509:
#backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509",
:node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
{[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703]}, :executor->start-time-secs {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852,
[12 12] 1404911852, [9 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}}
{noformat}

These log messages continue after the supervisor it's running on dies - nimbus continually
tries to reassign to a dead supervisor. Note that the other topology is reassigned elsewhere
without problems.

If the broken topology is rebalanced, only then does nimbus assign the topology to a working
supervisor.

Another symptom of this is that, when the machines running storm are started, only one topology
is running on startup. The second topology is not assigned to a supervisor. Again, it takes
a rebalance for nimbus to actually assign the topology somewhere.

A couple of possibly related bugs are STORM-256 and STORM-341, but I don't really understand
those bugs enough to be able to link it to these problems.

This is a major issue for us. One of the reasons for using storm is that if a supervisor were
to die, storm would automatically fail over to another supervisor. This does not happen, leaving
our cluster with a SPOF.

  was:
We're running two topologies on a cluster with 3 supervisors. By default, both topologies
are assigned onto the same supervisor. If that supervisor dies, storm reassigns one topology
to another supervisor but not the other, leaving the second topology inactive.

There are various symptoms/possible causes of this problem. In the nimbus logs, from when
the topologies are initially submitted, nimbus is continually trying to reassign the second
topology to the same supervisor every 10 seconds:

{code}
2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509:
#backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509",
:node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
{[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703]}, :executor->start-time-secs {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831,
[12 12] 1404911831, [9 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}}
2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509:
#backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509",
:node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
{[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703]}, :executor->start-time-secs {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841,
[12 12] 1404911841, [9 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}}
2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509:
#backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509",
:node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
{[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703]}, :executor->start-time-secs {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852,
[12 12] 1404911852, [9 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}}
{code}

These log messages continue after the supervisor it's running on dies - nimbus continually
tries to reassign to a dead supervisor. Note that the other topology is reassigned elsewhere
without problems.

If the broken topology is rebalanced, only then does nimbus assign the topology to a working
supervisor.

Another symptom of this is that, when the machines running storm are started, only one topology
is running on startup. The second topology is not assigned to a supervisor. Again, it takes
a rebalance for nimbus to actually assign the topology somewhere.

A couple of possibly related bugs are STORM-256 and STORM-341, but I don't really understand
those bugs enough to be able to link it to these problems.

This is a major issue for us. One of the reasons for using storm is that if a supervisor were
to die, storm would automatically fail over to another supervisor. This does not happen, leaving
our cluster with a SPOF.


> Nimbus does not reassign a topology when the supervisor dies
> ------------------------------------------------------------
>
>                 Key: STORM-397
>                 URL: https://issues.apache.org/jira/browse/STORM-397
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating
>         Environment: 2 topologies, 3 supervisors
>            Reporter: Simon Cooper
>            Priority: Critical
>
> We're running two topologies on a cluster with 3 supervisors. By default, both topologies
are assigned onto the same supervisor. If that supervisor dies, storm reassigns one topology
to another supervisor but not the other, leaving the second topology inactive.
> There are various symptoms/possible causes of this problem. In the nimbus logs, from
when the topologies are initially submitted, nimbus is continually trying to reassign the
second topology to the same supervisor every 10 seconds:
> {noformat}
> 2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509:
#backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509",
:node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
{[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703]}, :executor->start-time-secs {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831,
[12 12] 1404911831, [9 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}}
> 2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509:
#backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509",
:node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
{[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703]}, :executor->start-time-secs {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841,
[12 12] 1404911841, [9 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}}
> 2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for topology id Sync-1-1404911509:
#backtype.storm.daemon.common.Assignment{:master-code-dir "/storm/nimbus/stormdist/Sync-1-1404911509",
:node->host {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
{[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [5 5] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [12 12] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703], [7 8] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] ["9f5f2ddd-40ee-4ac1-b705-2957089af330"
6703]}, :executor->start-time-secs {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852,
[12 12] 1404911852, [9 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}}
> {noformat}
> These log messages continue after the supervisor it's running on dies - nimbus continually
tries to reassign to a dead supervisor. Note that the other topology is reassigned elsewhere
without problems.
> If the broken topology is rebalanced, only then does nimbus assign the topology to a
working supervisor.
> Another symptom of this is that, when the machines running storm are started, only one
topology is running on startup. The second topology is not assigned to a supervisor. Again,
it takes a rebalance for nimbus to actually assign the topology somewhere.
> A couple of possibly related bugs are STORM-256 and STORM-341, but I don't really understand
those bugs enough to be able to link it to these problems.
> This is a major issue for us. One of the reasons for using storm is that if a supervisor
were to die, storm would automatically fail over to another supervisor. This does not happen,
leaving our cluster with a SPOF.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message