This could happen if you have corrupted input that always causes a batch to fail and be retried. 

I have seen this behaviour before and I didn't see corrupted input. It might be a bug in trident, I'm not sure. If you figure it out please update this thread and/or submit a patch. 



On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi <danijel@schiavuzzi.com> wrote:
To (partially) answer my own question -- I still have no idea on the cause of the stuck topology, but re-submitting the topology helps -- after re-submitting my topology is now running normally.


On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <danijel@schiavuzzi.com> wrote:
Also, I did have multiple cases of my IBackingMap workers dying (because of RuntimeExceptions) but successfully restarting afterwards (I throw RuntimeExceptions in the BackingMap implementation as my strategy in rare SQL database deadlock situations to force a worker restart and to fail+retry the batch).

From the logs, one such IBackingMap worker death (and subsequent restart) resulted in the Kafka spout re-emitting the pending tuple:

    2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting batch, attempt 29698959:736

This is of course the normal behavior of a transactional topology, but this is the first time I've encountered a case of a batch retrying indefinitely. This is especially suspicious since the topology has been running fine for 20 days straight, re-emitting batches and restarting IBackingMap workers quite a number of times.

I can see in my IBackingMap backing SQL database that the batch with the exact txid value 29698959 has been committed -- but I suspect that could come from another BackingMap, since there are two BackingMap instances running (paralellismHint 2).

However, I have no idea why the batch is being retried indefinitely now nor why it hasn't been successfully acked by Trident.

Any suggestions on the area (topology component) to focus my research on?

Thanks,

On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <danijel@schiavuzzi.com> wrote:
Hello,

I'm having problems with my transactional Trident topology. It has been running fine for about 20 days, and suddenly is stuck processing a single batch, with no tuples being emitted nor tuples being persisted by the TridentState (IBackingMap).

It's a simple topology which consumes messages off a Kafka queue. The spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout and I use the trident-mssql transactional TridentState implementation to persistentAggregate() data into a SQL database.

In Zookeeper I can see Storm is re-trying a batch, i.e.

     "/transactional/<myTopologyName>/coordinator/currattempts" is "{"29698959":6487}"

... and the attempt count keeps increasing. It seems the batch with txid 29698959 is stuck, as the attempt count in Zookeeper keeps increasing -- seems like the batch isn't being acked by Trident and I have no idea why, especially since the topology has been running successfully the last 20 days.

I did rebalance the topology on one occasion, after which it continued running normally. Other than that, no other modifications were done. Storm is at version 0.9.0.1.

Any hints on how to debug the stuck topology? Any other useful info I might provide?

Thanks,

--
Danijel Schiavuzzi

E: danijel@schiavuzzi.com
W: www.schiavuzzi.com
T: +385989035562
Skype: danijel.schiavuzzi



--
Danijel Schiavuzzi

E: danijel@schiavuzzi.com
W: www.schiavuzzi.com
T: +385989035562
Skype: danijel.schiavuzzi



--
Danijel Schiavuzzi
Skype: danijels7