incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Potekhin <>
Subject Re: RMI/JMX errors, weird
Date Tue, 24 Apr 2012 18:46:10 GMT
Hello Aaron,

it's probably the over-optimistic number of concurrent compactors that 
was tripping the system.

I do not entirely understand what's the correlation here, maybe it's 
that the compactors were overloading
the neighboring nodes causing time-outs. I tuned the concurrency down 
and after a while things seem
to have settled down, thanks for the suggestion.


On 4/19/2012 4:13 PM, aaron morton wrote:
>> 1150 pending tasks, and is not
>> making progress.
> Not all pending tasks reported by nodetool compactionstats actually 
> run. Once they get a chance to run the files they were going to work 
> on may have already been compacted.
> Given that repair tests at double the phi threshold, it may not make 
> much difference.
> Did other nodes notice it was dead ? Was there anything in the log 
> that showed it was under duress (GC or dropped message logs) ?
> Is the compaction a consequence of repair ? (The streaming stage can 
> result in compactions). Or do you think the node is just behind on 
> compactions ?
> If you feel compaction is hurting the node, consider 
> setting concurrent_compactors in the yaml to 2.
> You can also isolate the node from updates using nodetool 
> disablegossip and disablerthrift , and the turn off the IO limiter 
> with nodetool setcompactionthroughput 0.
> Hope that helps.
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> On 20/04/2012, at 12:29 AM, Maxim Potekhin wrote:
>> Hello Aaron,
>> how should I go about fixing that? Also, after a repeated attempt to 
>> compact
>> it goes again into "building secondary index" with 1150 pending 
>> tasks, and is not
>> making progress. I suspected the disk system failure, but this needs 
>> to be confirmed.
>> So basically, do I need to tune the phi threshold up? Thing is, there 
>> was no heavy load
>> on the cluster at all.
>> Thanks
>> Maxim
>> On 4/19/2012 7:06 AM, aaron morton wrote:
>>> At some point the gossip system on the node this log is from decided 
>>> that was DOWN. This was based on how often the node 
>>> was gossiping to the cluster.
>>> The active repair session was informed. And to avoid failing the job 
>>> unnecessarily it tested that the errant nodes phi value was twice 
>>> the configured phi_convict_threshold. It was and the repair was killed.
>>> Take a look at the logs on and see if anything was 
>>> happening on the node at the same time. Could  be GC or an 
>>> overloaded node (it would log about dropped messages).
>>> Perhaps other nodes also saw as down? it only needed 
>>> to be down for a few seconds.
>>> Hope that helps.
>>> -----------------
>>> Aaron Morton

View raw message