cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Boris Yen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3006) Enormous counter
Date Wed, 10 Aug 2011 03:33:27 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082122#comment-13082122
] 

Boris Yen commented on CASSANDRA-3006:
--------------------------------------

In order to make it easier to reproduce this issue, I document how I recreate this issue step
by step.

1. clean any thing that is inside /var/lib/cassandra on node 172.17.19.151

2. start cassandra on node 172.17.19.151.

3. clean any thing that is inside /var/lib/cassnadra on node 172.17.19.152

4. modify the cassandra.yaml of 172.17.19.152 and add 172.17.19.151 as a seed.

5. start cassandra on node 172.17.19.152, I could see two node has formed a cluster, I also
double check that using nodetool.

6. on node 172.17.19.151, I use cassandra-cli: to connect 172.17.19.151/9160, and execute
commands -> 

create keyspace test
with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options = [{datacenter1:2}];

create column family testCounter
    with column_type = Super
    and default_validation_class = CounterColumnType
    and replicate_on_write = true
    and comparator = BytesType
    and subcomparator = BytesType
    and comment = 'APP status information.';

7. use the test program to add the counter 1000 times. between each adding action the program
will pause 50 millisecond.

8. in the middle of the adding process, shut down the cassandra on node 172.17.19.152, (let's
say I shut down node 172.17.19.152 when count is 200.). Because the test program changes the
consistency level to One when it encounters an exception (timeout exception to be exact),
the following adding actions will still be success.

9. wait for the overall adding process to complete. I saw "success counter: 999" due to one
exception. 

10. use the cassandra-cli to connect to 172.17.19.151 and 172.17.19.152 and check the counter
value, the value is 1001 on both nodes. It shows 1001 because hector will retry when it encounters
the timeout exception. 

11. shutdown the cassandra on 172.17.19.151, wait for a few seconds, I saw "InetAddress /172.17.19.151
is now dead" on node 172.17.19.152.

12. after seeing "InetAddress /172.17.19.151 is now dead", restart the cassandra on node 172.17.19.151.

13. check the counter again with cassandra-cli on both nodes, this time the counter should
no longer be 1001, it should be other weird number.

Hope someone else could recreate it by these steps.

> Enormous counter 
> -----------------
>
>                 Key: CASSANDRA-3006
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3006
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.8.3
>         Environment: ubuntu 10.04
>            Reporter: Boris Yen
>            Assignee: Sylvain Lebresne
>
> I have two-node cluster with the following keyspace and column family settings.
> Cluster Information:
>    Snitch: org.apache.cassandra.locator.SimpleSnitch
>    Partitioner: org.apache.cassandra.dht.RandomPartitioner
>    Schema versions: 
> 	63fda700-c243-11e0-0000-2d03dcafebdf: [172.17.19.151, 172.17.19.152]
> Keyspace: test:
>   Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>   Durable Writes: true
>     Options: [datacenter1:2]
>   Column Families:
>     ColumnFamily: testCounter (Super)
>     "APP status information."
>       Key Validation Class: org.apache.cassandra.db.marshal.BytesType
>       Default column value validator: org.apache.cassandra.db.marshal.CounterColumnType
>       Columns sorted by: org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db.marshal.BytesType
>       Row cache size / save period in seconds: 0.0/0
>       Key cache size / save period in seconds: 200000.0/14400
>       Memtable thresholds: 1.1578125/1440/247 (millions of ops/MB/minutes)
>       GC grace seconds: 864000
>       Compaction min/max thresholds: 4/32
>       Read repair chance: 1.0
>       Replicate on write: true
>       Built indexes: []
> Then, I use a test program based on hector to add a counter column (testCounter[sc][column])
1000 times. In the middle the adding process, I intentional shut down the node 172.17.19.152.
In addition to that, the test program is smart enough to switch the consistency level from
Quorum to One, so that the following adding actions would not fail. 
> After all the adding actions are done, I start the cassandra on 172.17.19.152, and I
use cassandra-cli to check if the counter is correct on both nodes, and I got a result 1001
which should be reasonable because hector will retry once. However, when I shut down 172.17.19.151
and after 172.17.19.152 is aware of 172.17.19.151 is down, I try to start the cassandra on
172.17.19.151 again. Then, I check the counter again, this time I got a result 481387 which
is so wrong.
> I use 0.8.3 to reproduce this bug, but I think this also happens on 0.8.2 or before also.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message