cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4417) invalid counter shard detected
Date Tue, 11 Sep 2012 14:37:07 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453058#comment-13453058
] 

Sylvain Lebresne commented on CASSANDRA-4417:
---------------------------------------------

That's a very good point. Counters do rely on the fact that nodes do not lose the increments
they are "leader" for (or that they don't reuse the same nodeId if they do), but unless the
commit log uses batch mode, this can happen. And that will lead to exactly the exception seen
here, so I'd say there's a very good chance this is the problem.

I'll note that if that is indeed a problem, it's very possible that the error logged happens
only much later (after the "unclean" shutdown) and on some other node than the one having
died. So not being able to correlate the error to an unclean shutdown doesn't really indicate
that it's not related.

The consequence of this happening is that the increments that have been lost with un-synced
commit log are lost. Meaning that with the default configuration, one could lose up to 10
seconds of the increments (for which the dying node is leader). However, I think it is also
possible to have results from read to miss slightly more than that, though that last part
should fix itself if the counter is incremented again.

As for the error message logged, it's possible that lots of them are logged even though only
a small number of counters are affected since it's print during column reconciliation and
thus could be logged many time for the same counter.

A simple "workaround" is to use batch commit log, but that has a potentially important performance
impact.

Another solution I've though of would be to try to detect unclean shutdown (by marking something
during clean shutdown and checking for that at startup) and if we detect one, to renew the
nodeId. The problem with that is that this potentially mean renewing the nodeId pretty often.
And each time we do that, the internal representation of counter grow and I'm really afraid
it will be a problem in that case. And while we have some mechanism to shrink back counter
by merging sub-counts when the nodeId is renewed too often, that mechanism assumes that the
node owning the nodeId has the more up-to-date value for this sub-count, which is exactly
the problem here. So overall I don't have any good idea to fix this. Other ideas?

                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7,
1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest
to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message