cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <>
Subject Re: Invalid Counter Shard errors?
Date Thu, 20 Sep 2012 06:19:47 GMT
> I don't understand what the three in parentheses values are exactly. I guess
> the last number is the count and the middle one is the number of increments,
> is that true ? What is the first string (identical in all the errors) ?

It's (UUID, clock, increment). Very  briefly, counter columns in
Cassandra are made up of multiple "shards". In the write path, a
particular counter increment is executed by one "leader" which is one
of the replicas of the counter. The leader will increment it's own
value, read it's own full value (this is why "Replicate On Write" has
to do reads in the write path for counters) and replicas to other

UUID "roughly" corresponds to a node in the cluster (UUID:s are
sometimes refreshed, so it's not a strict correlation). Clockid is
supposed to be monotonically increasing for a given UUID.

> How can the last number (assuming it's the count) be negative knowing that I
> only sum positive numbers ?

I don't see a negative number in you paste?

> An other point is that the highest value seems to be *always* the good one
> (assuming this time that the middle number is the number of increments).

DISCLAIMER: This is me responding off the cuff without digging into it further.

Depends on the source of the problem. If the problem, as theorized in
the ticket, is caused by non-clean shutdown of nodes the expected
result *should* be that we effectively "loose" counter increments.
Given a particular leader among the replicas, suppose you increment
counter C by N1, followed by un-clean shutdown with the value never
having been written to the commit log. On the next increment of C by
N2, a counter shard would be generated which has the value being
base+N2 instead of base+N1 (assuming the memtable wasn't flushed and
no other writes to the same counter column happened).

When this gets replicated to other nodes, they would see a value based
on N1 and a value based on N2, both with the same clock. It would
choose the higher one. In either case as far as I can tell (off the
top of my head), *some* counter increment is lost. The only way I can
see (again off the top of my head) the resulting value being correct
is if the later increment (N2 in this case) is somehow including N1 as
well (e.g., because it was generated by first reading the current
counter value).

/ Peter Schuller (@scode,

View raw message