storm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danijel Schiavuzzi (JIRA)" <>
Subject [jira] [Created] (STORM-406) Trident topologies getting stuck when using Netty transport (reproducible)
Date Thu, 17 Jul 2014 13:22:04 GMT
Danijel Schiavuzzi created STORM-406:

             Summary: Trident topologies getting stuck when using Netty transport (reproducible)
                 Key: STORM-406
             Project: Apache Storm (Incubating)
          Issue Type: Bug
    Affects Versions: 0.9.2-incubating, 0.9.1-incubating,
         Environment: Linux, OpenJDK 7
            Reporter: Danijel Schiavuzzi
            Priority: Critical

When using the new, default Netty transport, Trident topologies sometimes get stuck, while
under ZeroMQ everything is working fine.

I can reliably reproduce this issue by killing a Storm worker on a running Trident topology.
If the worker gets re-spawned on the same slot (port), the topology stops processing. But
if the worker re-spawns on a different port, topology processing continues normally.

The Storm cluster configuration is pretty standard, there are two Supervisor nodes, one node
has also Nimbus, UI and DRPC running on it. I have four slots per Supervisor, and run my test
topology with setNumWorkers set to 8 so that it occupies all eight slots across the cluster.
Killing a worker in this configuration will always re-spawn the worker on the same node and
slot (port), thus causing the topology to stop processing. This is 100% reproducible on a
few Storm clusters of mine, across multiple Storm versions (, 0.9.1, 0.9.2).

I have reproduced this with multiple Trident topologies, the simplest of which is the TridentWordCount
topology from storm-starter. I've just modified it a little to add an additional Trident filter
to log the tuple throughput:

Non-transactional Trident topologies just silently stop processing, while transactional topologies
continuously retry the batches and are re-emitted by the spout, however they never get processed
by the next bolts in the chain so they time out.

This message was sent by Atlassian JIRA

View raw message