storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mitchell Rathbun (BLOOMBERG/ 731 LEX)" <mrathb...@bloomberg.net>
Subject NotAliveException when running storm list
Date Mon, 08 Apr 2019 23:48:35 GMT
We run nimbus, supervisor, and the ui daemons on the same machine as a bunch of our topologies.
We have a start script that runs the following:

PROCESS_FILTER=`storm list | egrep -io "topology-prefix"${TOPOLOGY-ID}`
    if [[ ! -z "${PROCESS_FILTER}" ]]; then
        echo "Shutting down $TOPOLOGY_NAME in cluster mode"
        # Proper way of killing, in cluster mode
        $STORM_CMD kill $TOPOLOGY_NAME -w 5
        rc=$?
        if [[ $rc -ne 0 ]]; then
            exit ${rc}
        fi
    else
        echo "$TOPOLOGY_NAME in not running in either local-mode or cluster-mode"
    fi

......

Running this gave us the following stack trace in the nimbus logs:

2019-04-07 13:57:12,230 ERROR ProcessFunction [pool-14-thread-38] Internal error processing
getClusterInfo
org.apache.storm.generated.NotAliveException: null
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_172]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
~[?:1.8.0_172]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
~[?:1.8.0_172]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_172]
        at clojure.lang.Reflector.invokeConstructor(Reflector.java:180) ~[clojure-1.7.0.jar:?]
        at org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:562) ~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.daemon.nimbus$get_resources_for_topology.invoke(nimbus.clj:918)
~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.daemon.nimbus$get_cluster_info$iter__10704__10708$fn__10709.invoke(nimbus.clj:1583)
~[storm-core-1.2.1.jar:1.2.1]
        at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.7.0.jar:?]
        at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.7.0.jar:?]
        at clojure.lang.Cons.next(Cons.java:39) ~[clojure-1.7.0.jar:?]
        at clojure.lang.RT.next(RT.java:674) ~[clojure-1.7.0.jar:?]
        at clojure.core$next__4112.invoke(core.clj:64) ~[clojure-1.7.0.jar:?]
        at clojure.core$dorun.invoke(core.clj:3010) ~[clojure-1.7.0.jar:?]
        at clojure.core$doall.invoke(core.clj:3025) ~[clojure-1.7.0.jar:?]
        at org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1564) ~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10799.getClusterInfo(nimbus.clj:2019)
~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920)
~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904)
~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.security.auth.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:144)
~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
~[storm-core-1.2.1.jar:1.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_172]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_172]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]

Followed by:

13:57:12 tsdbdwnp289: WingmanTopology289 is not running in either local-mode or cluster-mode

And in Supervisor:

2019-04-07 13:57:13,559 INFO  BasicContainer [Thread-35] Worker Process d9622e0d-edb6-41e4-9d74-8c1a42f23ad1
exited with code: 20
2019-04-07 13:57:13,603 INFO  BasicContainer [Thread-37] Worker Process 744ab585-b8d8-4bcd-909b-e55a46887e67
exited with code: 20
2019-04-07 13:57:13,668 INFO  BasicContainer [Thread-38] Worker Process 26e2d425-f203-4758-82c1-2d8beaee1b00
exited with code: 20
2019-04-07 13:57:13,696 INFO  BasicContainer [Thread-34] Worker Process 70af57ee-c871-459b-88e1-b1bc2553d832
exited with code: 20
2019-04-07 13:57:13,698 INFO  BasicContainer [Thread-40] Worker Process 5b7bbb1d-1b84-4ee3-bcd5-35175db1b710
exited with code: 20
2019-04-07 13:57:14,218 INFO  BasicContainer [Thread-39] Worker Process f1abdf84-303f-4579-9e5c-ccc16c3f418e
exited with code: 20
2019-04-07 13:57:14,244 INFO  BasicContainer [Thread-41] Worker Process 37719483-8f36-4c75-8a81-4e25dc53a23d
exited with code: 20

However, none of the worker processes that we had an issue with ever exited. So going off
of the above code, I believe that this issue was caused by an exception from calling 'storm
list'. Any idea how this could have happened? Why would 'storm list' cause a NotAliveException
in Nimbus? It seems to be a transient issue, as we were able to successfully shut down the
topology later in the day. This all occurred during a machine turn, so a lot of topologies
were coming down in succession.


Mime
View raw message