I ran the same cql query against my 3 nodes (after adding the third and repairing each of them):

On the new node:

cqlsh:mykeyspace> select '20121029#myevent' from 'mycf' where key = '887#day';

 20121029#myevent
-------------------
              4983

On the 2 others (old nodes):

cqlsh:mykeyspace> select '20121029#myevent' from 'mycf' where key = '887#day';
 20121029#myevent
-------------------
              4254

And the read value atc CL.QUORUM is 4943, which is the good value.

How is it possible that QUORUM read 4943 with only 1 node out of 3 answering that count ?
How could a new node, get a value that none of other existing node has ?
Is there a way to fix the data (isn't repair supposed to do it) ?

Alain



2012/11/1 Alain RODRIGUEZ <arodrime@gmail.com>
"Can you try it thought, or run a repair ?"

Repairing didn't help

"My first thought is to use QUOURM"

This fix the problem. However, my data is probably still inconsistent, even if I read now always the same value. The point is that I can't handle a crash with CL.QUORUM, I can't even restart a node...

I will add a third server.

"But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before..."
"I was asking to understand how you did the upgrade."

Ok. On my side I am just concern about the possibility of using counters with CL.ONE and correctly handle a crash or restart without a drain.

Alain



2012/11/1 aaron morton <aaron@thelastpickle.com>
"What CL are you using ?"

I think this can be what causes the issue. I'm writing and reading at CL ONE. I didn't drain before stopping Cassandra and this may have produce a fail in the current counters (those which were being written when I stopped a server).
My first thought is to use QUOURM. But with only two nodes it's hard to get strong consistency using  QUOURM.  
Can you try it thought, or run a repair ? 

But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before...
I was asking to understand how you did the upgrade. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 1/11/2012, at 11:39 AM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:

"What version of cassandra are you using ?"

1.1.2

"Can you explain this further?"

I had an unexplained amount of reads (up to 1800 r/s and 90 Mo/s) on one server the other was doing about 200 r/s and 5 Mo/s max. I fixed it by rebooting the server. This server is dedicated to cassandra. I can't tell you more about it 'cause I don't get it... But a simple Cassandra restart wasn't enough.

"Was something writing to the cluster ?"

Yes we are having some activity and perform about 600 w/s.

"Did you drain for the upgrade ?"

We upgrade a long time ago and to 1.1.2. This warning is about the version 1.1.6.

"What changes did you make ?"

In the cassandra.yaml I just change the "compaction_throughput_mb_per_sec" property to slow down my compaction a bit. I don't think the problem come from here.

"Are you saying that a particular counter column is giving different values for different reads ?"

Yes, this is exactly what I was saying. Sorry if something is wrong with my English, it's not my mother tongue.

"What CL are you using ?"

I think this can be what causes the issue. I'm writing and reading at CL ONE. I didn't drain before stopping Cassandra and this may have produce a fail in the current counters (those which were being written when I stopped a server).

But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before...

Thank you for your time Aaron, once again.

Alain



2012/10/31 aaron morton <aaron@thelastpickle.com>
What version of cassandra are you using ?

 I finally restart Cassandra. It didn't solve the problem so I stopped Cassandra again on that node and restart my ec2 server. This solved the issue (1800 r/s to 100 r/s).
Can you explain this further?
Was something writing to the cluster ?

Today I changed my cassandra.yml and restart this same server to apply my conf.
What changes did you make ?

I just noticed that my homepage (which uses a Cassandra counter and refreshes every sec) shows me 4 different values. 2 of them repeatedly (5000 and 4000) and the 2 other some rare times (5500 and 3800)
Are you saying that a particular counter column is giving different values for different reads ? 
What CL are you using ?

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 31/10/2012, at 3:39 AM, Jason Wee <peichieh@gmail.com> wrote:

maybe enable the debug in log4j-server.properties and going through the log to see what actually happen?

On Tue, Oct 30, 2012 at 7:31 PM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:
Hi, 

I have an issue with counters, yesterday I had a lot of ununderstandable reads/sec on one server. I finally restart Cassandra. It didn't solve the problem so I stopped Cassandra again on that node and restart my ec2 server. This solved the issue (1800 r/s to 100 r/s).

Today I changed my cassandra.yml and restart this same server to apply my conf.

I just noticed that my homepage (which uses a Cassandra counter and refreshes every sec) shows me 4 different values. 2 of them repeatedly (5000 and 4000) and the 2 other some rare times (5500 and 3800)

Only the counters made today and yesterday are concerned.

I performed a repair without success. These data are the heart of our business so if someone had any clue on it, I would be really grateful...

The sooner the better, I am in production with these random counters.

Alain

INFO:

My environnement is 2 nodes (EC2 large), RF 2, CL.ONE (R & W), Random Partitioner.

xxx.xxx.xxx.241    eu-west     1b          Up     Normal  151.95 GB       50.00%              0
xxx.xxx.xxx.109    eu-west     1b          Up     Normal  117.71 GB       50.00%              85070591730234615865843651857942052864