What do your TP Stats look like under load? Are you actually using the 100 read/write threads ? What is the IO platform and what sort of load is that under and how many cores do the machines have ? It's interested that you seem to having a better time with such high values.
If you are overloading the cluster to the point where it cannot perform the requested operation (at the CL) within rpc_timeout it will raise aTimedOutException through the API. In this case the coordinator sent messages to nodes, and some of them may have completed the request. So the state of the request is unknown.
But this temporary overloading may only apply in a subset of the cluster, and so may only affect a small amount of the requests. e.g.10 nodes and RF 3, if say nodes 2 and 3 under heavy compaction or repair load and started dropping messages, it may result in TimedOutExceptions for the token range from node 1 and 2. All parts of the ring would could continue to work, so it's difficult for the cluster as a whole to say it's overloaded.
It also makes sense for the read/write TP's to be so large in cassandra as when the (internal) message is received the TCP connection thread places the message into TP for the target stage and then processes the next message on the socket. Any further processing of the message or local machine state is left for the target thread pool.
On 1 Apr 2011, at 23:42, Terje Marthinussen wrote:
The reason I am asking is obviously that we saw a bunch of stability issues for a while.
We had some periods with a lot of dropped messages, but also a bunch of dead/UP messages without drops (followed by hintedhandoffs) and loads of read repairs.
This all seems to work a lot better after increasing to 100 read/mutation threads per node (12 node cluster).
I am not entirely this is a "just forget" matter. I would much prefer that cassandra was able to somehow tell the client that "this is too fast! Please slow down" instead of having an infinitely large queue. Queuing data rarely get you out of trouble (only gets you into more), so queues should in my opinion never get longer than what is needed to smoothen out peaks in traffic.
It may also sane to make some thread pool allocation or priority handling for read/writes coming in through internal communication to avoid triggering hints or read repairs without good reason.
I guess I could play around with throttle limit in the roundrobin schdeduler for this.
On Tue, Mar 29, 2011 at 7:42 PM, aaron morton <firstname.lastname@example.org>
The concurrent_reads and concurrent_writes set the number of threads in the relevant thread pools. You can view the number of active and queued tasks using nodetool tpstats.
The thread pool uses a blocking linked list for it's work queue with a max size of Integer.MAX_VALUE. So it's size is essentially unbounded. When (internode) messages are received by a node they are queued into the relevant thread pool for processing. When (certain) messages are executed it checks the send time of the message and will not process it if it is more than rpc_timeout (typically 10 seconds) old. This is where the "dropped messages" log messages come from.
The coordinator will wait up to rpc_timeout for the CL number of nodes to respond. So if say one node is under severe load and cannot process the read in time, but the others are ok a request at QUORUM would probably succeed. However if a number of nodes are getting a beating the co-ordinator may time out resulting in the client getting a TimedOutException.
For the read path it's a little more touchy. Only the "nearest" node is sent a request for the actual data, the others are asked for a digest of the data they would return. So if the "nearest" node is the one under load and times out the request will time out even if CL nodes returned. Thats what the DynamicSnitch is there for, a node under load would less likely to be considered the "nearest" node.
The read and write thread pools are really just dealing with reading and writing data on the local machine. Your request moves through several other threads / thread pools: connection thread, outbound TCP pool, inbound TCP pool and message response pool. The SEDA paper referenced on this page was the model for using thread pools to manage access to resources http://wiki.apache.org/cassandra/ArchitectureInternals
In summary, don't worry about it unless you see the thread pools backing up and messages being dropped.
Hope that helps
On 28 Mar 2011, at 19:55, Terje Marthinussen wrote:
I was pondering about how the concurrent_read and write settings balances towards max read/write threads in clients.
Lets say we have 3 nodes, and concurrent read/write set to 8.
That is, 8*3=24 threads for reading and writing.
Replication factor is 3.
Lets say we have clients that in total set up 16 connections to each node.
Now all the clients write at the same time. Since the replication factor is 3, you could get up to 16*3=48 concurrent write request per node (which needs to be handled by 8 threads)?
What is the result if this load continues?
Could you see that replication of data fails (at least initially) causing all kinds of fun timeouts around in the system?
Same on the read side.
If all clients read at the same time with Consistency level QUORUM, you get 16*2 read requests in best case (and more in worst case)?
Could you see that one node answers, but another one times out due to lack of read threads, causing read repair which again further degrades?
How does this queue up internally between thrift, gossip and the threads doing the actual read and writes?