cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3070) counter repair
Date Thu, 01 Sep 2011 14:41:10 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095321#comment-13095321
] 

Sylvain Lebresne commented on CASSANDRA-3070:
---------------------------------------------

The last txt file you attached is just a copy of your comment.

Now, what you describe is roughly what I got from the previous log. But the thing is, there
is nothing wrong with the way the counter values are resolved. There seems to be a value in
there that shouldn't exists though. So truth is without a way to reproduce it will be harder
to find what could be wrong in there. Do attach the newly generated log though, there could
be something slightly different that'll help. I'll continue to look the code in the eyes,
see if I find something.

A few questions though that could help narrowing it down:
* You said that 2 of the servers return a lower number. Can you be sure however that the "right"
value should be the greater one ? For instance, do you do only increment > 1 ? Or better,
do you have another source that would allow you to tell what the right value is ?
* Does that happen with many counters ? Your initial description does suggests it happens
to more than one, but do you have an idea of how frequent it is. And if you have multiple
bad counters, are the node that are out of sync always the same nodes ?
* You marked that 0.8.4 is affected, but have the cluster been started on 0.8.4. Or more precisely
do you have an example of a counter that is problematic and you are sure have been created
*after* your upgrade to 0.8.4 ? (I want to be sure I can definitively rule out some unfortunate
consequence of CASSANDRA-2968; even though I doubt this could be it).
* Just to be sure, you did not remove one sstable by mistake or something like that ? Or truncated
the counter column family ?

Last thing, if there is indeed more than one problematic counters, if you could attach output
logs for at least two of them would be helpful. There could be some similarity that helps
finding what's wrong.

> counter repair
> --------------
>
>                 Key: CASSANDRA-3070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3070
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.4
>            Reporter: ivan
>            Assignee: Sylvain Lebresne
>         Attachments: counter_local_quroum_maybeschedulerepairs.txt, counter_local_quroum_maybeschedulerepairs_2.txt
>
>
> Hi!
> We have some counters out of sync but repair doesn't sync values.
> We tried nodetool repair.
> We use LOCAL_QUORUM for read. A repair row mutation is sent to other nodes while reading
a bad row but counters wasn't repaired by mutation.
> Output of two nodes were uploaded. (Some new debug messages were added.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message