storm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (STORM-406) Trident topologies getting stuck when using Netty transport (reproducible)
Date Thu, 24 Jul 2014 20:56:41 GMT


ASF GitHub Bot commented on STORM-406:

GitHub user kishorvpatil opened a pull request:

    [STORM-406] Fix for reconnect logic in netty client 

      - Check if channel ``isConnected``.
      - Reconnect if not before start of each batch.
      - Increase max-retried for netty client so that other worker get enough time to start/restart
and starts accepting new netty connections.

You can merge this pull request into a Git repository by running:

    $ git pull netty-client-fix

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #205
commit e1e6a602e330d71410b1876ca9fb6bfc29761f35
Author: Kishor Patil <>
Date:   2014-07-24T19:34:38Z

    Fix netty client reconnect issue

commit 9b3a632340b66f1fe493220c0e6c22c2912e8025
Author: Kishor Patil <>
Date:   2014-07-24T19:36:41Z

    Increase netty max retries defaults allowing more time for other workers to come up

commit f1f5ecd92c7ea55d01163c8dbc2360466f34fd3a
Author: Kishor Patil <>
Date:   2014-07-24T20:49:22Z

    Increase max tries and reset local channel variable


> Trident topologies getting stuck when using Netty transport (reproducible)
> --------------------------------------------------------------------------
>                 Key: STORM-406
>                 URL:
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating, 0.9.1-incubating,
>         Environment: Linux, OpenJDK 7
>            Reporter: Danijel Schiavuzzi
>            Priority: Critical
>              Labels: b
> When using the new, default Netty transport, Trident topologies sometimes get stuck,
while under ZeroMQ everything is working fine.
> I can reliably reproduce this issue by killing a Storm worker on a running Trident topology.
If the worker gets re-spawned on the same slot (port), the topology stops processing. But
if the worker re-spawns on a different port, topology processing continues normally.
> The Storm cluster configuration is pretty standard, there are two Supervisor nodes, one
node has also Nimbus, UI and DRPC running on it. I have four slots per Supervisor, and run
my test topology with setNumWorkers set to 8 so that it occupies all eight slots across the
cluster. Killing a worker in this configuration will always re-spawn the worker on the same
node and slot (port), thus causing the topology to stop processing. This is 100% reproducible
on a few Storm clusters of mine, across multiple Storm versions (, 0.9.1, 0.9.2).
> I have reproduced this with multiple Trident topologies, the simplest of which is the
TridentWordCount topology from storm-starter. I've just modified it a little to add an additional
Trident filter to log the tuple throughput:
> Non-transactional Trident topologies just silently stop processing, while transactional
topologies continuously retry the batches and are re-emitted by the spout, however they never
get processed by the next bolts in the chain so they time out.

This message was sent by Atlassian JIRA

View raw message