I agree, we just never shut down any node, neither had any crash, and yet we have these bugs.
About your side note :
We know about it, but we couldn't find any other way to be able to provide real-time analytics. If you do so, we would be really glad to hear about it.
We need both to serve statistics in real-time and be accurate about prices and we need a coherence between what's shown in our graphics and tables and the invoices we provide to our customers.
What we do is trying to avoid timeouts as much as possible (increasing the time before a timeout and getting a the lowest CPU load possible). In order to keep a low latency for the user we write first the events in a queue message (Kestrel) and then we process it with storm, which writes the events and increments counters in Cassandra.
Once again if you got a clue about a better way of doing this, we are always happy to learn and try to enhance our architecture and our process.