cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <>
Subject Re: repair question
Date Tue, 24 May 2011 14:05:08 GMT
On Tue, May 24, 2011 at 9:41 AM, Sylvain Lebresne <>wrote:

> On Tue, May 24, 2011 at 12:40 AM, Daniel Doubleday
> <> wrote:
> > We are performing the repair on one node only. Other nodes receive
> reasonable amounts of data (~500MB).  It's only the repairing node itself
> which 'explodes'.
> That, for instance, is a bit weird. That the node on which the repair
> is performed get more data is expected, since it is repair with all
> it's "neighbor" while the neighbors themselves get repaired only
> against that given node. But when differences between two A and B are
> computed, the ranges to repair are streaming both from A to B and for
> B to A. Unless A and B are widely out of sync (like A has no data and
> B has tons of it), around the same amount of data should transit in
> both way. So with RF=3, the node on with repair was started should get
> around 4 times (up to 6 times if you have weird topology) as much data
> than any neighboring node, but that's is. While if I'm correct, you
> are reporting that the neighboring node gets ~500MB and the
> "coordinator" gets > 700GB ?!
> Honestly I'm not sure an imprecision of the merkle tree could account
> for that behavior.
> Anyway, Daniel, would you be able to share the logs of the nodes (at
> least the node on which repair is started) ? I'm not sure how much
> that could help but that cannot hurt.
> --
> Sylvain
> >
> > I must admit that I'm a noob when it comes to aes/repair. Its just
> strange that a cluster that is up and running with no probs is doing that.
> But I understand that its not supposed to do what its doing. I just hope
> that I find out why soon enough.
> >
> >
> >
> > On 23.05.2011, at 21:21, Peter Schuller <>
> wrote:
> >
> >>> I'm a bit lost: I tried a repair yesterday with only one CF and that
> didn't really work the way I expected but I thought that would be a bug
> which only affects that special case.
> >>>
> >>> So I tried again for all CFs.
> >>>
> >>> I started with a nicely compacted machine with around 320GB of load.
> Total disc space on this node was 1.1TB.
> >>
> >> Did you do repairs simultaneously on all nodes?
> >>
> >> I have seen very significant disk space increases under some
> >> circumstances. While I haven't filed a ticket about it because there
> >> was never time to confirm, I believe two things were at play:
> >>
> >> (1) nodes were sufficiently out a sync in a sufficiently spread out
> >> fashion that the granularity of the merkle tree (IIRC, and if I read
> >> correctly, it divides the ring into up to 2^15 segments but no more)
> >> became ineffective so that repair effectively had to transfer all the
> >> data. at first I thought there was an outright bug, but after looking
> >> at the code I suspected it was just the merkle tree granularity.
> >>
> >> (2) I suspected at the time that a contributing factor was also that
> >> as one repair might cause a node to significantly increase it's live
> >> sstables temporarily until they are compacted, another repair on
> >> another node may start and start validating compaction and streaming
> >> of that data - leading to disk space bload essentially being
> >> "contagious"; the third node streaming from the node that was
> >> temporarily bloated, will receive even more data from that node than
> >> it normally would.
> >>
> >> We're making sure to only run one repair at a time between any hosts
> >> that are neighbors of each other (meaning that at RF=3, that's 1
> >> concurrent repair per 6 nodes in the cluster).
> >>
> >> I'd be interested in hearing anyone confirm or deny whether my
> >> understanding of (1) in particular is correct. To connect it to
> >> reality: a 20 GB CF divided into 2^15 segments implies each segment is
> >>> 600 kbyte in size. For CF:s with tens or hundreds of millions of
> >> small rows and a fairly random (with respect to partitioner) update
> >> pattern, it's not very difficult to end up in a situation where most
> >> 600 kbyte chunks contain out-of-synch data. Particularly in a
> >> situation with lots of dropped messages.
> >>
> >> I'm getting the 2^15 from AntiEntropyService.Validator.Validator()
> >> which passes a maxsize of 2^15 to the MerkelTree constructor.
> >>
> >> --
> >> / Peter Schuller
> >

I never run repair for just this reason. It is very intensive and it
produces a lot of data. I am still in 0.6.X so this would be better for me
when I upgrade. If i had to take a wild stab at it I would guess that "guys
like us" with 300 GB of data and possibly tiny rows run a greater chance of
something not being in sync then those with 20 GB a node for example.

If your are doing a high volume of inserts and disabled HH or even set it to
only store hints for an hour, and you had a three hour outage some nodes are
going to be out of sync.

View raw message