"I think that's inconsistent with the hypothesis that unclean shutdown is the sole cause of these problems"

I agree, we just never shut down any node, neither had any crash, and yet we have these bugs.

About your side note :

We know about it, but we couldn't find any other way to be able to provide real-time analytics. If you do so, we would be really glad to hear about it.
We need both to serve statistics in real-time and be accurate about prices and we need a coherence between what's shown in our graphics and tables and the invoices we provide to our customers.
What we do is trying to avoid timeouts as much as possible (increasing the time before a timeout and getting a the lowest CPU load possible). In order to keep a low latency for the user we write first the events in a queue message (Kestrel) and then we process it with storm, which writes the events and increments counters in Cassandra.

Once again if you got a clue about a better way of doing this, we are always happy to learn and try to enhance our architecture and our process.


2012/9/20 Peter Schuller <peter.schuller@infidyne.com>
The significance I think is: If it is indeed the case that the higher
value is always *in fact* correct, I think that's inconsistent with
the hypothesis that unclean shutdown is the sole cause of these
problems - as long as the client is truly submitting non-idempotent
counter increments without a read-before-write.

As a side note: If hou're using these counters for stuff like
determining amounts of money to be payed by somebody, consider the
non-idempotense of counter increments. Any write that increments a
counter, that fails by e.g. Timeout *MAY OR MAY NOT* have been applied
and cannot be safely retried. Cassandra counters are generally not
useful if *strict* correctness is desired, for this reason.

/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)