storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Giroux <>
Subject Re: Topology is stuck after upgrade to Storm 2.2.0 - how can I analyze what's going on?
Date Wed, 18 Nov 2020 14:20:59 GMT
 Thanks for info.  We/I havent tapped into the metrics (yet?).  Glad you got your problem
    On Wednesday, November 18, 2020, 09:14:21 AM EST, Adam Honen <> wrote:
I've managed to resolve this, so it's probably best to share what was the issue in my case.As
mentioned above, we have our own back pressure mechanism.It's all controlled from the spout,
so I figured out (read: guessed) we're probably hitting Storm's limit for the Spout's queue.
After increasing topology.executor.receive.buffer.size further, so it became larger than our
own limit (50K in this case) the issue is resolved.
Now, as for identifying this more easily next time, I see in the code that this configuration
is read in WorkerState.mkReceiveQueueMap and sent to the constructor of JCQueue where a metrics
object is created.Looks like there are some really useful metrics reported there.
So for next time I plan on hooking up to these metrics (either via one of the built in reporters,
or a via new implementation better geared for our needs) and reporting some of them to our
monitoring system.That should make troubleshooting such issues way simpler.
I haven't tested this part yet and it's not documented here:
, but hopefully it should still work.

On Tue, Nov 17, 2020 at 4:08 PM Adam Honen <> wrote:

I'm wondering what sort of metrics, logs, or other indications I can use in order to understand
why my topology gets stuck after ugrading from Storm 1.1.1 to Storm 2.2.0.

More in length:
I have a 1.1.1 cluster with 40 workers processing ~400K events/second.It starts by reading
from Kinesis via the AWS KCL and this is also used to implement our own backpressure. That
is, when the topology is overloaded with tuples, we stop reading from Kinesis until enough
progress has been made (we've been able to checkpoint).After that, we resume reading.
However, with so many workers we don't really see back pressure being needed even when dealing
with much larger event rates.
We've now created a similar cluster with storm 2.2.0 and I've tried deploying our topology
there.However, what happens is that within a couple of seconds, no more Kinesis records get
read. The topology appears to be just waiting forever without processing anything.
I would like to troubleshoot this, but I'm not sure where to collect data from.My initial
suspicion was that the new back pressure mechanism, now found in Storm 2, might have kicked
in and that I need to configure it in order to resolve this issue. However, this is nothing
more than a guess. I'm not sure how I can actually prove or disprove this without lots of
trail & error.
I've found some documentation about backpressure in the performance tuning chapter of the
documentation, but that only concentrates on configuration parameters and doesn't give information
about how to really understand what's going on in a running topology.

View raw message