incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Doubleday <>
Subject Re: Dynamic Snitch / Read Path Questions
Date Mon, 13 Dec 2010 13:00:53 GMT
Hi Peter

I should have started with the why instead of what ...

Background Info (I try to be brief ...)

We have a very small production cluster (started with 3 nodes, now we have 5). Most of our
data is currently in mysql but we want to slowly move the larger tables which are killing
our mysql cache to cassandra. After migration of another table we faced under-capacity last
week. Since we couldn't cope with it and the old system was still running in the background
we switched back and added another 2 nodes while still performing writes to the cluster (but
no reads).

We are entirely IO-bound. What killed us last week were to many reads combined with flushes
and compactions. Reducing compaction priority helped but it was not enough. Them main problem
why we could not add nodes though had to do with the quorum reads we are doing:

First we stopped compaction on all nodes. Everything was golden. The cluster handled the load
easily. Than we bootstrapped a new node. That increased the IO-pressure on the node which
was selected as streaming source because it started anti-compcation. The increase pressure
slowed the node down. Just a little, but enough to get flooded by digest requests from the
other nodes. We have seen this before:

So the status at that point was: 2 nodes that were serving requests at 10 - 50ms and one that
errr... wouldn't serve requests (average response time storage proxy was 10 secs). The problem
here was that users would suffer from the slow server because it was not down and still being
queried from clients. Also because the streaming node was so overwhelmed anticompaction became
*real* slow. It would have taken days to finish.

Than we took one node down to prevent digest flooding and it was much better. It almost worked
out but at peak hours it collapsed. At this point we rolled back.

From this my learning / guessing is:

We could have survived this if the streaming node would not have had to serve 

- read requests (because than users would not have been affected) and / or
- digest requests (because that would have reduced io pressure)

To summarize: I want to prevent that the slowest node

A) affects latency at the level of StorageProxy
B) gets digest requests because the only thing they are good for is killing it 

Ok sorry - that was not brief ...

Back to my original mail. 1-3) were supposed to describe the current situation in order to
validate that my understanding is actually correct.

Point A) would be achieved with quorum when the slowest node would never be selected to perform
the actual data read. That way ReadResponseResolver would be satisfied without waiting for
the slowest node. I think that's what's supposed to happen when using the dynamic snitch without
any code change. And that was the point of question 1) in my last mail. question 2) is important
to me because in the future we might want to read at CL 1 for other use cases and cassandra
seems to shortcut the messaging service in the case where the proxy node contains the row.
Thus in that case the node would respond even if it's the slowest node. So its not load balancing.

With question 3) I wanted to verify that my understanding of digest requests is correct. Before
digging deeper I thought that cassandra would have some magical way of being able to calculate
md5 digests for conflict resolution without reading the column values. But (I think) obviously
it cannot do that because conflict resolution is not based on timestamps alone but will fall
back to selecting the bigger value if timestamps are equal. This statement is important to
me because it means that digest requests are equally expensive as regular read requests. And
that I might be able to reduce IO pressure significantly when I change the way quorum read
are performed.

And thats what question 4) was about:

Your question was:

> | Am I interpreting you correctly that you want to switch the default
> | read mode in the case of QUOROM to optimistically assume that data is
> | consistent and read from one node and only perform digest on the rest?

Well almost :-) In fact I believe you are describing what is happening now with a vanilla
cassandra. Quorum reads are performed by reading the data from the node which is selected
by the end point snitch (that is self, same rack, same dc with the standard one or score based
with dynamic snitch). All other live nodes will receive digest requests.

Only when a digest mismatch occurs full read requests are sent to all nodes.

BTW: It seems (but thats probably only misinterpretation of the code) that disabling read
repair is bad when doing quorum reads. For instance if the two digest requests return before
the data request and one of the digest requests return a mismatch but the overall response
would be valid nothing would be returned even though the correct data was present. The same
thing seems to be true for every mismatch. I guess that using the read repair config here
might be wrong or at least not intuitive.

What I want to do is change the behavior that on the first request run only the two 'best'
nodes would be uses. One for data one for digest. To do that you'd only have to sort the live
nodes by 'proximity' and use the first two. Only if that fails I would switch back to default
behavior and do a full data read on all nodes. So in a perfect world of 3 consistent nodes
I would reduce IO reads by 1/3.

Best, and thanks for your time (if you bothered to read all of this),

On Dec 13, 2010, at 9:19 AM, Peter Schuller wrote:

>> 1) If using CL > 1 than using the dynamic snitch will result in a data read
>> from node with the lowest latency (little simplified) even if the proxy node
>> contains the data but has a higher latency that other possible nodes which
>> means that it is not necessary to do load-based balancing on the client
>> side.
>> 2) If using CL =1 than the proxy node will always return the data itself
>> even when there is another node with less load.
>> 3) Digest requests will be sent to all other living peer nodes for that key
>> and will result in a data read on all nodes to calculate the digest. The
>> only difference is that the data is not sent back but IO-wise it is just as
>> expensive.
> I think I may just be completely misunderstanding something, but I'm
> not really sure to what extent you're trying to describe the current
> situation and to what extent you're suggesting changes? I'm not sure
> about (1) and (2) though my knee-jerk reaction is that I would expect
> it to be mostly agnostic w.r.t. which node happens to be taking the
> RPC call (e.g. the "latency" may be due to disk I/O and preferring the
> local node has lots of potential to be detrimental, while forwarding
> is only slightly more expensive).
> (3) sounds like what's happening with read repair.
>> The next one goes a little further:
>> We read / write with quorum / rf = 3.
>> It seems to me that it wouldn't be hard to patch the StorageProxy to send
>> only one read request and one digest request. Only if one of the requests
>> fail we would have to query the remaining node. We don't need read repair
>> because we have to repair once a week anyways and quorum guarantees
>> consistency. This way we could reduce read load significantly which should
>> compensate for latency increase by failing reads. Am I missing something?
> Am I interpreting you correctly that you want to switch the default
> read mode in the case of QUOROM to optimistically assume that data is
> consistent and read from one node and only perform digest on the rest?
> What's the goal here? The only thin saved by digest reads at QUROM
> seems to me to be the throughput saved by not saving the data. You're
> still taking the reads in terms of potential disk I/O, and you still
> have to wait for the response, and you're still taking almost all of
> the CPU hit (still reading and checksumming, just not sending back).
> For highly contended data the need to fallback to real needs would
> significantly increase average latency.
> -- 
> / Peter Schuller

View raw message