cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: One hot node slows down whole cluster
Date Wed, 17 Aug 2011 23:27:55 GMT
wrt the Exception something has shutdown the Mutation thread pool. The only thing I can see
in the code to do this is nodetool drain and running the Embedded server. If it was drain
you should see an INFO level messages "Node is drained" somewhere. Could either of these things
be happening ? 

wrt the slow down:
- what CL are you using  for reads and writes ? What does the ring look like ? 
- have a look at tp stats to see what stage backing up 
- ensure you have the dynamic snitch enabled
- what setting do you have for dynamic_snitch_badness_threshold in yaml
- have a look at the o.a.c.DynamicEndpointSnitch info in JMX / JConsole  at dumpTimings()
and scores

Basically slower nodes should be used less. But there are reasons they may not be, so lets
work out what requests are running slow and if the Dynamic Snitch is doing the right thing.
I would look at that error first, seems odd.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 18/08/2011, at 6:52 AM, Hefeng Yuan wrote:

> Just wondering, would it help if we shorten the rpc_timeout_in_ms (currently using 30,000),
so that when one node gets hot and responding slowly, others will just take it as down and
move forward?
> 
> On Aug 17, 2011, at 11:35 AM, Hefeng Yuan wrote:
> 
>> Sorry, correction, we're using 0.8.1.
>> 
>> On Aug 17, 2011, at 11:24 AM, Hefeng Yuan wrote:
>> 
>>> Hi,
>>> 
>>> We're noticing that when one node gets hot (very high cpu usage) because of 'nodetool
repair', the whole cluster's performance becomes really bad.
>>> 
>>> We're using 0.8.1 with random partition. We have 6 nodes with RF 5. Our repair
is scheduled to run once a week, spread across whole cluster. I do get suggestion from Jonothan
that 0.8.0 has some bug on the repair, but wondering why one hot node would slow down the
whole cluster.
>>> 
>>> We saw this symptom yesterday on one node, and today on the adjacent node. Most
probably it'll happen on the next one tomorrow.
>>> 
>>> We do see lots of (~200) RejectedExecutionException 3 hours before the repair
job, and also in the middle of the repair job, not sure whether they're related. Full stack
is attached in the end.
>>> 
>>> Do we have any suggestion/hint?
>>> 
>>> Thanks,
>>> Hefeng
>>> 
>>> 
>>> ERROR [pool-2-thread-3097] 2011-08-17 08:42:38,118 Cassandra.java (line 3462)
Internal error processing batch_mutate
>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down
>>> 	at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
>>> 	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>>> 	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>>> 	at org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360)
>>> 	at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241)
>>> 	at org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62)
>>> 	at org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99)
>>> 	at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210)
>>> 	at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154)
>>> 	at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560)
>>> 	at org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:511)
>>> 	at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:519)
>>> 	at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454)
>>> 	at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
>>> 	at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> 	at java.lang.Thread.run(Thread.java:619)
>>> ERROR [Thread-137480] 2011-08-17 08:42:38,121 AbstractCassandraDaemon.java (line
113) Fatal exception in thread Thread[Thread-137480,5,main]
>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down
>>> 	at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
>>> 	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>>> 	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>>> 	at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:444)
>>> 	at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)
>> 
> 


Mime
View raw message