cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <>
Subject RE: Questions related to the data in SSTable files
Date Wed, 23 Oct 2013 12:23:32 GMT
We enabled the major repair on every node every 7 days.
I think you mean 2 cases of "failed" write. 
One is the replication failure of a writer. Duplication generated from this kind of "failed"
should be very small in my case, because I only parse the data from 12 nodes, which should
NOT contain any replication nodes.
If one node persistent a write, plus a "hint" of failed replication write, this write will
still store as one write in its SSTable files, right? Why need to store 2 copies as duplication
in SSTable files?
Another case is what you describe as client retries writing when time-out exception happens.
This can explain the duplication reasonable.
Here is the duplication count happened in our SSTable files. You can see a lot of data duplicate
2 times, but also some with even higher number. But max duplication count is 27, can one client
retry 27 times?
duplication_count duplication_occurrence

2 123615348
3 6446783
4 21102
5 1054
6 2496
7 47
8 726
9 52
10 12
11 3
12 7
13 9
14 7
15 3
16 2
17 2
18 1
19 5
20 5
22 1
23 3
25 2
27 99
Another question is do you have any guess what could cause case 2 happen in my original email?
Date: Tue, 22 Oct 2013 17:52:24 -0700
Subject: Re: Questions related to the data in SSTable files

On Tue, Oct 22, 2013 at 5:17 PM, java8964 java8964 <> wrote:

Any way I can verify how often the system being "repaired"? I can ask another group who maintain
the Cassandra cluster. But do you mean that even the failed writes will be stored in the SSTable

"repair" sessions are logged in system.log, and the "best practice" is to run a repair once
every gc_grace_seconds, which defaults to 10 days.

A "failed" write means only that it "failed" to meet its ConsistencyLevel in the request_timeout.
It does not mean that it failed to write everywhere it tried to write. There is no rollback,
so in practice with RF>1 it is likely that a "failed" write succeeded at least somewhere.
But if any failure is noted, Cassandra will generate a hint for hinted handoff and attempt
to redeliver the "failed" write. Also, many/most client applications will respond to a timedoutexception
by attempting to re-write the "failed" write, using the same client timestamp.

Repair has a fixed granularity, so the larger the size of your dataset the more "over-repair"
any given "repair" will cause.
Duplicates occur as a natural consequences of this, if you have 1 row which differs in the
merkle tree chunk and the merkle tree chunk is, for example, 1000 rows.. you will "repair"
one row and "duplicate" the other 999.
View raw message