incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@datastax.com>
Subject Re: repair question
Date Tue, 24 May 2011 13:32:31 GMT
On Mon, May 23, 2011 at 9:21 PM, Peter Schuller
<peter.schuller@infidyne.com> wrote:
>> I'm a bit lost: I tried a repair yesterday with only one CF and that didn't really
work the way I expected but I thought that would be a bug which only affects that special
case.
>>
>> So I tried again for all CFs.
>>
>> I started with a nicely compacted machine with around 320GB of load. Total disc space
on this node was 1.1TB.
>
> Did you do repairs simultaneously on all nodes?
>
> I have seen very significant disk space increases under some
> circumstances. While I haven't filed a ticket about it because there
> was never time to confirm, I believe two things were at play:
>
> (1) nodes were sufficiently out a sync in a sufficiently spread out
> fashion that the granularity of the merkle tree (IIRC, and if I read
> correctly, it divides the ring into up to 2^15 segments but no more)
> became ineffective so that repair effectively had to transfer all the
> data. at first I thought there was an outright bug, but after looking
> at the code I suspected it was just the merkle tree granularity.

It's slightly more complicated than that but that's the main idea yes.
The key samples are used to help splitting the merkle tree "where data is".
Basically this means that the ring is not divided in 2^15 equivalent segment.
Which basically means that it's not the ring that is divided in 2^15, but the
range the node has (which is a little bit better). In 0.8, it's even a little
bit better since we compute a tree for each range, which mean that each time
we'll be dividing only one range (while in 0.7 it's all the ranges the node is
responsible for taken together). Not sure how clear I am, but anyway it is
true that at some point the granularity of the tree may be "not enough".

And the more spread out the out of sync is, the worse it will be. Though in
general we can expect that to not be too spread out. For the same reason than
why caches work.

So maybe it is worth improving the granularity of the tree. It is true that
having a fixed granularity as we do is probably a bit naive. In any case, it
will still be a trade-off between the precision of the tree and it's size
(which impact memory used and we have to transfer it (though the transfer part
could be done by tree level -- right now the whole tree is always transfered
which is not so optimal)).

Now there's probably a number of ideas we could implement to improve that
precision, but that may require not trivial changes, so really I'd like to be
sure we're not beating a dead horse. Maybe a first step would be to instrument
the repair process to compute the number of rows (and data size) that each
merkle tree segment end up containing for instance.
I've created https://issues.apache.org/jira/browse/CASSANDRA-2698 for that.

>
> (2) I suspected at the time that a contributing factor was also that
> as one repair might cause a node to significantly increase it's live
> sstables temporarily until they are compacted, another repair on
> another node may start and start validating compaction and streaming
> of that data - leading to disk space bload essentially being
> "contagious"; the third node streaming from the node that was
> temporarily bloated, will receive even more data from that node than
> it normally would.
>
> We're making sure to only run one repair at a time between any hosts
> that are neighbors of each other (meaning that at RF=3, that's 1
> concurrent repair per 6 nodes in the cluster).
>
> I'd be interested in hearing anyone confirm or deny whether my
> understanding of (1) in particular is correct. To connect it to
> reality: a 20 GB CF divided into 2^15 segments implies each segment is
>> 600 kbyte in size. For CF:s with tens or hundreds of millions of
> small rows and a fairly random (with respect to partitioner) update
> pattern, it's not very difficult to end up in a situation where most
> 600 kbyte chunks contain out-of-synch data. Particularly in a
> situation with lots of dropped messages.
>
> I'm getting the 2^15 from AntiEntropyService.Validator.Validator()
> which passes a maxsize of 2^15 to the MerkelTree constructor.
>
> --
> / Peter Schuller
>

Mime
View raw message