cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: balance between concurrent_[reads|writes] and feeding/reading threads i clients
Date Mon, 04 Apr 2011 11:29:58 GMT
What do your TP Stats look like under load? Are you actually using the 100 read/write threads
? What is the IO platform and what sort of load is that under and how many cores do the machines
have ? It's interested that you seem to having a better time with such high values.  

If you are overloading the cluster to the point where it cannot perform the requested operation
(at the CL) within rpc_timeout it will raise aTimedOutException through the API. In this case
the coordinator sent messages to nodes, and some of them may have completed the request. So
the state of the request is unknown. 

But this temporary overloading may only apply in a subset of the cluster, and so may only
affect a small amount of the requests. e.g.10 nodes and RF 3, if say nodes 2 and 3 under heavy
compaction or repair load and started dropping messages, it may result in TimedOutExceptions
for the token range from node 1 and 2. All parts of the ring would could continue to work,
so it's difficult for the cluster as a whole to say it's overloaded. 

It also makes sense for the read/write TP's to be so large in cassandra as when the (internal)
message is received the TCP connection thread places the message into TP for the target stage
and then processes the next message on the socket. Any further processing of the message or
local machine state is left for the target thread pool.


On 1 Apr 2011, at 23:42, Terje Marthinussen wrote:

> The reason I am asking is obviously that we saw a bunch of stability issues for a while.
> We had some periods with a lot of dropped messages, but also a bunch of dead/UP messages
without drops (followed by hintedhandoffs) and loads of read repairs.
> This all seems to work a lot better after increasing to 100 read/mutation threads per
node (12 node cluster).
>  I am not entirely this is a "just forget" matter. I would much prefer that cassandra
was able to somehow tell the client that "this is too fast! Please slow down" instead of having
an infinitely large queue. Queuing data rarely get you out of trouble (only gets you into
more), so queues should in my opinion never get longer than what is needed to smoothen out
peaks in traffic.
> It may also sane to make some thread pool allocation or priority handling for read/writes
coming in through internal communication to avoid triggering hints or read repairs without
good reason.
> I guess I could play around with throttle limit in the roundrobin schdeduler for this.
> Regards, 
> Terje
> On Tue, Mar 29, 2011 at 7:42 PM, aaron morton <> wrote:
> The concurrent_reads and concurrent_writes set the number of threads in the relevant
thread pools. You can view the number of active and queued tasks using nodetool tpstats. 
> The thread pool uses a blocking linked list for it's work queue with a max size of Integer.MAX_VALUE.
So it's size is essentially unbounded. When (internode) messages are received by a node they
are queued into the relevant thread pool for processing. When (certain) messages are executed
it checks the send time of the message and will not process it if it is more than rpc_timeout
(typically 10 seconds) old. This is where the "dropped messages" log messages come from. 

> The coordinator will wait up to rpc_timeout for the CL number of nodes to respond. So
if say one node is under severe load and cannot process the read in time, but the others are
ok a request at QUORUM would probably succeed. However if a number of nodes are getting a
beating the co-ordinator may time out resulting in the client getting a TimedOutException.

> For the read path it's a little more touchy. Only the "nearest" node is sent a request
for the actual data, the others are asked for a digest of the data they would return. So if
the "nearest" node is the one under load and times out the request will time out even if CL
nodes returned. Thats what the DynamicSnitch is there for, a node under load would less likely
to be considered the "nearest" node. 
> The read and write thread pools are really just dealing with reading and writing data
on the local machine. Your request moves through several other threads / thread pools: connection
thread, outbound TCP pool, inbound TCP pool and message response pool. The SEDA paper referenced
on this page was the model for using thread pools to manage access to resources
> In summary, don't worry about it unless you see the thread pools backing up and messages
being dropped. 
> Hope that helps
> Aaron
> On 28 Mar 2011, at 19:55, Terje Marthinussen wrote:
>> Hi, 
>> I was pondering about how the concurrent_read and write settings balances towards
max read/write threads in clients.
>> Lets say we have 3 nodes, and concurrent read/write set to 8.
>> That is, 8*3=24 threads for reading and writing.
>> Replication factor is 3.
>> Lets say we have clients that in total set up 16 connections to each node.
>> Now all the clients write at the same time. Since the replication factor is 3, you
could get up to 16*3=48  concurrent write request per node (which needs to be handled by 8
>> What is the result if this load continues?
>> Could you see that replication of data fails (at least initially) causing all kinds
of fun timeouts around in the system?
>> Same on the read side. 
>> If all clients read at the same time with Consistency level QUORUM, you get 16*2
read requests in best case (and more in worst case)?
>> Could you see that one node answers, but another one times out due to lack of read
threads, causing read repair which again further degrades?
>> How does this queue up internally between thrift, gossip and the threads doing the
actual read and writes? 
>> Regards,
>> Terje

View raw message