storm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Poulosky (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (STORM-406) Trident topologies getting stuck when using Netty transport (reproducible)
Date Thu, 24 Jul 2014 18:50:39 GMT

    [ https://issues.apache.org/jira/browse/STORM-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073515#comment-14073515
] 

Paul Poulosky commented on STORM-406:
-------------------------------------

We were able to reproduce this outside of Trident, using a modified Exclamation topology that
has two workers and parallelism of 1 on the spouts and bolts.

    Word-Spout (worker1)
              |
              V
      Exclaim 1 (worker1)
              | 
              V
      Exclaim 2 (worker2)

If the worker2 containing the downstream bolt is killed and relaunched, the upstream worker
does not recognize that the connection went down, and makes no attempt to reconnect.

We are working on a fix and will submit a patch soon.   This should be a blocking issue for
0.9.3.

> Trident topologies getting stuck when using Netty transport (reproducible)
> --------------------------------------------------------------------------
>
>                 Key: STORM-406
>                 URL: https://issues.apache.org/jira/browse/STORM-406
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating, 0.9.1-incubating, 0.9.0.1
>         Environment: Linux, OpenJDK 7
>            Reporter: Danijel Schiavuzzi
>            Priority: Critical
>
> When using the new, default Netty transport, Trident topologies sometimes get stuck,
while under ZeroMQ everything is working fine.
> I can reliably reproduce this issue by killing a Storm worker on a running Trident topology.
If the worker gets re-spawned on the same slot (port), the topology stops processing. But
if the worker re-spawns on a different port, topology processing continues normally.
> The Storm cluster configuration is pretty standard, there are two Supervisor nodes, one
node has also Nimbus, UI and DRPC running on it. I have four slots per Supervisor, and run
my test topology with setNumWorkers set to 8 so that it occupies all eight slots across the
cluster. Killing a worker in this configuration will always re-spawn the worker on the same
node and slot (port), thus causing the topology to stop processing. This is 100% reproducible
on a few Storm clusters of mine, across multiple Storm versions (0.9.0.1, 0.9.1, 0.9.2).
> I have reproduced this with multiple Trident topologies, the simplest of which is the
TridentWordCount topology from storm-starter. I've just modified it a little to add an additional
Trident filter to log the tuple throughput: https://github.com/dschiavu/storm-trident-stuck-topology
> Non-transactional Trident topologies just silently stop processing, while transactional
topologies continuously retry the batches and are re-emitted by the spout, however they never
get processed by the next bolts in the chain so they time out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message