I'm wondering what sort of metrics, logs, or other indications I can use in order to understand why my topology gets stuck after ugrading from Storm 1.1.1 to Storm 2.2.0.
More in length:
I have a 1.1.1 cluster with 40 workers processing ~400K events/second.
It starts by reading from Kinesis via the AWS KCL and this is also used to implement our own backpressure. That is, when the topology is overloaded with tuples, we stop reading from Kinesis until enough progress has been made (we've been able to checkpoint).
After that, we resume reading.
However, with so many workers we don't really see back pressure being needed even when dealing with much larger event rates.
We've now created a similar cluster with storm 2.2.0 and I've tried deploying our topology there.
However, what happens is that within a couple of seconds, no more Kinesis records get read. The topology appears to be just waiting forever without processing anything.
I would like to troubleshoot this, but I'm not sure where to collect data from.
My initial suspicion was that the new back pressure mechanism, now found in Storm 2, might have kicked in and that I need to configure it in order to resolve this issue. However, this is nothing more than a guess. I'm not sure how I can actually prove or disprove this without lots of trail & error.
I've found some documentation about backpressure in the performance tuning chapter of the documentation, but that only concentrates on configuration parameters and doesn't give information about how to really understand what's going on in a running topology.