incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hefeng Yuan <hfy...@rhapsody.com>
Subject Re: One hot node slows down whole cluster
Date Wed, 17 Aug 2011 23:44:34 GMT
Thanks Aaron for the response. We're not doing drain on node, and there's no that message in
the log.

We used LOCAL_QUORUM CL,
endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch
dynamic_snitch: false
dynamic_snitch_badness_threshold: 0.0

Because we have another 3 nodes DC for Brisk, maybe that's the problem?


On Aug 17, 2011, at 4:27 PM, aaron morton wrote:

> wrt the Exception something has shutdown the Mutation thread pool. The only thing I can
see in the code to do this is nodetool drain and running the Embedded server. If it was drain
you should see an INFO level messages "Node is drained" somewhere. Could either of these things
be happening ? 
> 
> wrt the slow down:
> - what CL are you using  for reads and writes ? What does the ring look like ? 
> - have a look at tp stats to see what stage backing up 
> - ensure you have the dynamic snitch enabled
> - what setting do you have for dynamic_snitch_badness_threshold in yaml
> - have a look at the o.a.c.DynamicEndpointSnitch info in JMX / JConsole  at dumpTimings()
and scores
> 
> Basically slower nodes should be used less. But there are reasons they may not be, so
lets work out what requests are running slow and if the Dynamic Snitch is doing the right
thing. I would look at that error first, seems odd.
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 18/08/2011, at 6:52 AM, Hefeng Yuan wrote:
> 
>> Just wondering, would it help if we shorten the rpc_timeout_in_ms (currently using
30,000), so that when one node gets hot and responding slowly, others will just take it as
down and move forward?
>> 
>> On Aug 17, 2011, at 11:35 AM, Hefeng Yuan wrote:
>> 
>>> Sorry, correction, we're using 0.8.1.
>>> 
>>> On Aug 17, 2011, at 11:24 AM, Hefeng Yuan wrote:
>>> 
>>>> Hi,
>>>> 
>>>> We're noticing that when one node gets hot (very high cpu usage) because
of 'nodetool repair', the whole cluster's performance becomes really bad.
>>>> 
>>>> We're using 0.8.1 with random partition. We have 6 nodes with RF 5. Our repair
is scheduled to run once a week, spread across whole cluster. I do get suggestion from Jonothan
that 0.8.0 has some bug on the repair, but wondering why one hot node would slow down the
whole cluster.
>>>> 
>>>> We saw this symptom yesterday on one node, and today on the adjacent node.
Most probably it'll happen on the next one tomorrow.
>>>> 
>>>> We do see lots of (~200) RejectedExecutionException 3 hours before the repair
job, and also in the middle of the repair job, not sure whether they're related. Full stack
is attached in the end.
>>>> 
>>>> Do we have any suggestion/hint?
>>>> 
>>>> Thanks,
>>>> Hefeng
>>>> 
>>>> 
>>>> ERROR [pool-2-thread-3097] 2011-08-17 08:42:38,118 Cassandra.java (line 3462)
Internal error processing batch_mutate
>>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down
>>>> 	at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
>>>> 	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>>>> 	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>>>> 	at org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360)
>>>> 	at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241)
>>>> 	at org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62)
>>>> 	at org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99)
>>>> 	at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210)
>>>> 	at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154)
>>>> 	at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560)
>>>> 	at org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:511)
>>>> 	at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:519)
>>>> 	at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454)
>>>> 	at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
>>>> 	at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
>>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>> 	at java.lang.Thread.run(Thread.java:619)
>>>> ERROR [Thread-137480] 2011-08-17 08:42:38,121 AbstractCassandraDaemon.java
(line 113) Fatal exception in thread Thread[Thread-137480,5,main]
>>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down
>>>> 	at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
>>>> 	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>>>> 	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>>>> 	at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:444)
>>>> 	at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)
>>> 
>> 
> 


Mime
View raw message