cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <>
Subject Re: what's the difference between repair CF separately and repair the entire node?
Date Wed, 14 Sep 2011 06:47:33 GMT
On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu <> wrote:
> me neither don't want to repair one CF at the time.
> the "node repair" took a week and still running, compactionstats and
> netstream shows nothing is running on every node,  and also no error
> message, no exception, really no idea what was it doing,

To add to the list of things repair does wrong in 0.7, we'll have to add that
if one of the node participating in the repair (so any node that share a range
with the node on which repair was started) goes down (even for a short time),
then the repair will simply hang forever doing nothing. And no specific
error message will be logged. That could be what happened. Again, recent
releases of 0.8 fix that too.


> I stopped yesterday.  maybe I should run repair again while disable
> compaction on all nodes?
> thanks!
> On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller
> <> wrote:
>> > I think it is a serious problem since I can not "repair".....  I am
>> > using cassandra on production servers. is there some way to fix it
>> > without upgrade?  I heard of that 0.8.x is still not quite ready in
>> > production environment.
>> It is a serious issue if you really need to repair one CF at the time.
>> However, looking at your original post it seems this is not
>> necessarily your issue. Do you need to, or was your concern rather the
>> overall time repair took?
>> There are other things that are improved in 0.8 that affect 0.7. In
>> particular, (1) in 0.7 compaction, including validating compactions
>> that are part of repair, is non-concurrent so if your repair starts
>> while there is a long-running compaction going it will have to wait,
>> and (2) semi-related is that the merkle tree calculation that is part
>> of repair/anti-entropy may happen "out of synch" if one of the nodes
>> participating happen to be busy with compaction. This in turns causes
>> additional data to be sent as part of repair.
>> That might be why your immediately following repair took a long time,
>> but it's difficult to tell.
>> If you're having issues with repair and large data sets, I would
>> generally say that upgrading to 0.8 is recommended. However, if you're
>> on 0.7.4, beware of
>> --
>> / Peter Schuller (@scode on twitter)

View raw message