cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Doubleday <>
Subject Re: repair question
Date Wed, 25 May 2011 08:06:27 GMT
Ok - obviously these haven't been my brightest days.

The stream request sent to the neighbors doesn't contain the CF for which the ranges have
been determined to mismatch.
So every diff in every CF will result in getting that range from every CF of the neighbor.

That explains everything.

So I guess my next repair will be scheduled in 0.8.1.

But I don't understand why this did not hit others so hard that it is considered more critical.
We seem to use cassandra in unusual ways.

Thanks again.


On May 24, 2011, at 9:05 PM, Daniel Doubleday wrote:

> Ok thanks for your help Sylvain - much appreciated 
> In short: I believe that most of this is me not looking clearly yesterday. There are
only one / two points that i don't get. 
> Maybe you could help me out there.
> First the ~500MB thing is BS. The closer neighbors recieved around 80G and the other
2 aroung 40G.
> Sorry about that but I got your attention :-)
> My missing pieces are:
> 1. Why was I running out of space. I checked again and found that I started with 761G
free disc space?
> To make it simple I will only look at one CF 'BlobStore' which is the evil large one
which makes up for 80%.
> I greped for the streaming metadata in the log and summed it up: Total streaming file
size was 279G.
> This comes as a real surprise but still ...
> 2. The file access times are strange: why does the node receive data before differencing
has finished?
> On the repairing node I see first differencing for that one ended 13:02:
> grep streaming /var/log/cassandra/system.log
> ....
>  INFO [AntiEntropyStage:1] 2011-05-23 13:02:52,990 (line 491)
Performing streaming repair of 2088 ranges for #<TreeRequest manual-repair-ab469cff-98fb-46fa-9ad4-476a77860ed8,
/, (Smeet,ObjectRepository)>
> ....
> a listing I did on that node in the data dir shows that data files arrive much earlier
> ls -al *tmp*
> ...
> -rw-r--r-- 1 cass cass   146846246 May 23 12:14 BlobStore-tmp-f-16356-Data.db
> -rw-r--r-- 1 cass cass      701291 May 23 12:14 BlobStore-tmp-f-16357-Data.db
> -rw-r--r-- 1 cass cass     6628735 May 23 12:14 BlobStore-tmp-f-16358-Data.db
> -rw-r--r-- 1 cass cass        9991 May 23 12:14 BlobStore-tmp-f-16359-Data.db
> ...
> The youngest file for every CF was written at 12:14 which is the time the first differencing
>  INFO [AntiEntropyStage:1] 2011-05-23 12:14:36,255 (line 491)
Performing streaming repair of 71 ranges for #<TreeRequest manual-repair-ab469cff-98fb-46fa-9ad4-476a77860ed8,
/, (Smeet,Rooms
> I thought that cassandra would stream directly from the sstables without tmp files and
that these are the files received from the other nodes?
> 3. That's only loosely related but how could a repairing node ever receive data that
is not requested because of a merkle tree diff. 
> If you look at Only one tree request
was generated but still the repairing node got all that data from the other CFs.
> That's in fact one of the reasons why I thought that there might be a bug that sends
to much data in the first place.
> Thanks for reading this book
> Daniel
> If your interested here's the log:
> I also lied about total size of one node. It wasn't 320 but 280. All nodes 
> On May 24, 2011, at 3:41 PM, Sylvain Lebresne wrote:
>> On Tue, May 24, 2011 at 12:40 AM, Daniel Doubleday
>> <> wrote:
>>> We are performing the repair on one node only. Other nodes receive reasonable
amounts of data (~500MB).  It's only the repairing node itself which 'explodes'.
>> That, for instance, is a bit weird. That the node on which the repair
>> is performed get more data is expected, since it is repair with all
>> it's "neighbor" while the neighbors themselves get repaired only
>> against that given node. But when differences between two A and B are
>> computed, the ranges to repair are streaming both from A to B and for
>> B to A. Unless A and B are widely out of sync (like A has no data and
>> B has tons of it), around the same amount of data should transit in
>> both way. So with RF=3, the node on with repair was started should get
>> around 4 times (up to 6 times if you have weird topology) as much data
>> than any neighboring node, but that's is. While if I'm correct, you
>> are reporting that the neighboring node gets ~500MB and the
>> "coordinator" gets > 700GB ?!
>> Honestly I'm not sure an imprecision of the merkle tree could account
>> for that behavior.
>> Anyway, Daniel, would you be able to share the logs of the nodes (at
>> least the node on which repair is started) ? I'm not sure how much
>> that could help but that cannot hurt.
>> --
>> Sylvain
>>> I must admit that I'm a noob when it comes to aes/repair. Its just strange that
a cluster that is up and running with no probs is doing that. But I understand that its not
supposed to do what its doing. I just hope that I find out why soon enough.
>>> On 23.05.2011, at 21:21, Peter Schuller <> wrote:
>>>>> I'm a bit lost: I tried a repair yesterday with only one CF and that
didn't really work the way I expected but I thought that would be a bug which only affects
that special case.
>>>>> So I tried again for all CFs.
>>>>> I started with a nicely compacted machine with around 320GB of load.
Total disc space on this node was 1.1TB.
>>>> Did you do repairs simultaneously on all nodes?
>>>> I have seen very significant disk space increases under some
>>>> circumstances. While I haven't filed a ticket about it because there
>>>> was never time to confirm, I believe two things were at play:
>>>> (1) nodes were sufficiently out a sync in a sufficiently spread out
>>>> fashion that the granularity of the merkle tree (IIRC, and if I read
>>>> correctly, it divides the ring into up to 2^15 segments but no more)
>>>> became ineffective so that repair effectively had to transfer all the
>>>> data. at first I thought there was an outright bug, but after looking
>>>> at the code I suspected it was just the merkle tree granularity.
>>>> (2) I suspected at the time that a contributing factor was also that
>>>> as one repair might cause a node to significantly increase it's live
>>>> sstables temporarily until they are compacted, another repair on
>>>> another node may start and start validating compaction and streaming
>>>> of that data - leading to disk space bload essentially being
>>>> "contagious"; the third node streaming from the node that was
>>>> temporarily bloated, will receive even more data from that node than
>>>> it normally would.
>>>> We're making sure to only run one repair at a time between any hosts
>>>> that are neighbors of each other (meaning that at RF=3, that's 1
>>>> concurrent repair per 6 nodes in the cluster).
>>>> I'd be interested in hearing anyone confirm or deny whether my
>>>> understanding of (1) in particular is correct. To connect it to
>>>> reality: a 20 GB CF divided into 2^15 segments implies each segment is
>>>>> 600 kbyte in size. For CF:s with tens or hundreds of millions of
>>>> small rows and a fairly random (with respect to partitioner) update
>>>> pattern, it's not very difficult to end up in a situation where most
>>>> 600 kbyte chunks contain out-of-synch data. Particularly in a
>>>> situation with lots of dropped messages.
>>>> I'm getting the 2^15 from AntiEntropyService.Validator.Validator()
>>>> which passes a maxsize of 2^15 to the MerkelTree constructor.
>>>> --
>>>> / Peter Schuller

View raw message