incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philippe <watche...@gmail.com>
Subject Re: Unable to repair a node
Date Tue, 16 Aug 2011 20:48:08 GMT
I'm still trying different stuff. Here are my latest findings, maybe someone
will find them useful:

   - I have been able to repair some small column families by issuing a
   repair [KS] [CF]. When testing on the ring with no writes at all, it still
   takes about 2 repairs to get "consistent" logs for all AES requests.
   - Launching a repair one the smallest CF of the biggest KS has triggered
   a flurry of compactions and streams. Some of those streams are for other CF
   in that keyspace !?
   - During repairs (one at a time cluster-wide), I get 25-50% io waits &
   35%-50% cpu usage on a 6 core SATA-disk setup

What is surprising to me (bug?) is that netstats shows me streams going from
node A to node B at 0% progress. But netstats on node B doesn't show me any
streams coming in. I'm thinking that repairs may be never ending and that
may be messing up my compactions hence the huge pile up of compactions until
the disk fulls.
I know there's an issue related to failed streams & repairs, could I be
hitting it ?

Thanks

2011/8/14 Philippe <watcherfr@gmail.com>

> @Teijo : thanks for the procedure, I hope I won't have to do that
>
> Peter, I'll answer inline. Thanks for the detailed answer.
>
>
>> > the number of SSTables for some keyspaces goes dramatically up (from 3
>> or 4
>> > to several dozens).
>>
>> Typically with a long running compaction, such as that triggered by
>> repair, that's what happens as flushed memtables accumulate. In
>> particular for memtables with frequent flushes.
>>
>> Are you running with concurrent compaction enabled?
>>
> Yes, it is enabled. On my 0.8 cluster, cassandra.yaml has this (it's
> commented). BTW, I have 6 cores on each server.
>
> #concurrent_compactors: 1
>
> > the commit log keeps increasing in size, I'm at 4.3G now, it went up to
>> 40G
>> > when the compaction was throttled at 16MB/s. On the other nodes it's
>> around
>> > 1GB at most
>> Hmmmm. The Commit Log should not be retained longer than what is
>> required for memtables to be flushed. Is it possible you have had an
>> out-of-disk condition and flushing has stalled? Are you seeing flushes
>> happening in the log?
>>
> No I don't believe there was ever an out of disk.  Yes it is flushing for
> the first couple of hours.
> Then, when repair seems locked up, my log is mostly filled with lines such
> as this
> INFO [ScheduledTasks:1] 2011-08-14 23:15:47,267 StatusLogger.java (line 88)
> [My_Keyspace].[My_Columnfamily]           45,105541               50/50
>           20/20
>  Why is that ?
>
> > the data directory is bigger than on the other nodes. I've seen it go up
>> to
>> > 480GB when the compaction was throttled at 16MB/s
>> How much data are you writing? Is it at all plausible that the huge
>> spike is a reflection of lots of overwriting writes that aren't being
>> compacted?
>>
> No, there's no bulk loading going on at the moment and I'm pretty sure
> there wasn't when it spiked up to that load.
> I've never measured the load because it's a mix of counter increments and
> new counters all the time. It's not that much though.
>
>
>> Normally when disk space spikes with repair it's due to other nodes
>> streaming huge amounts (maybe all of their data) to the node, leading
>> to a temporary spike. But if your "real" size is expected to be 60,
>> 480 sounds excessive. Are you sure other nodes aren't running repairs
>> at the same time and magnifying each other's data load spikes?
>>
> Yes, the two other nodes were running repairs. I had them scheduled at 8
> hour intervals but they must have started.
> When data is streamed from one to another, does that data go into the
> commit log as a regular write ?
>  How much of a negative impact can that have on the repair going on on this
> node ?
>
> > What's even weirder is that currently I have 9 compactions running but
>> CPU
>> > is throttled at 1/number of cores half the time (while > 80% the rest of
>> the
>> > time). Could this be because other repairs are happening in the ring ?
>> You mean compaction is taking less CPU than it "should"?
>>
> Yes
>
>
>> No, this should not be due to other nodes repairing. However it sounds
>> to me like you are bottlenecking on I/O and the repairs and
>>
> Yes, I/O is really high on the node right now. Around 50% I/O waits.
>
>
>> compactions are probably proceeding extremely slowly, probably being
>> completely drowned out by live traffic (which is probably having an
>> abnormally high performance impact due to data size spike).
>>
> Yes, the live traffic is 3 to 10x times slower during repair. Ouch... I
> hope I won't to do this too often while in production !
>
>
>>
>> What's your read concurrency configured on the node? What does "iostat
>> -x -k 1" show in the average queue size column?
>
> Average queue size on the disk (RAID-1 + separate LVM volumes for data,
> commit log, caches, logs)) varies between 2 and 90. I'd say the average is
> around 30-40. Very high variation.
>
>
>> Is "nodetool -h
>> localhost tpstats" showing that ReadStage is usually "full" (@ your
>> limit)?
>>
> No backlog at all in tpstats
>
> I've figured out how AES is logging its actions and it looks like it really
> is going through every CF in every keyspace and doing a tree request for
> every token range
> So it really looks like it's just taking forever to compact stuff as it's
> repairing.
> I saw in another email that repairing was taking 2-3mn/ GB... it looks like
> a lot more for my ring. Anybody else have numbers ?
>
> Thanks
>

Mime
View raw message