On Sun, Aug 8, 2010 at 5:24 AM, aaron morton <email@example.com>
Not sure how feasible it is or if it's planned. But it would probably require that the nodes are able so share the state of their row cache so as to know which parts to warm. Otherwise it sounds like you're assuming the node can hold the entire data set in memory.
Im not assuming the node can hold the entire data set in cassandra in memory, if thats what you meant. I was thinking of sharing the state of the row cache, but only those keys that are being moved for the token. the other keys can stay hidden to the node.
If you know in your application when you would like data to be in the cache, you can send a query like get_range_slices to the cluster and ask for 0 columns. That will warm the row cache for the keys it hits.
This is a tuff one as our row cache is over 20 million and takes a while to get a large hit ratio. so while we try to preload it is taking requests. If it were possible to bring up a node that doesnt announce its availability to the cluster that would help us manually warm the cache. I know this feature is in the issue tracker currently, but didnt look like it would come out anytime before 0.8.
I have heard it mentioned that the coordinator node will take action to when one node is considered to be running slow. So it may be able to work around the new node until it gets warmed up.
That is interesting i haven't heard that one. I think with the parallel reads that are happening it makes sense that it would be possible. That is unless the data is local. I believe in that case it always prefers to read local vs over the network, so if the local machine is the slow node that wouldnt help.
Are you adding nodes often?
Currently not that often. The main issue is we have very stringent latency requirements and anything that would affect those we have to understand the worst case cost to see if we can avoid them.
On 7 Aug 2010, at 11:17, Artie Copeland wrote:
the way i understand how row caches work is that each node has an independent cache, in that they do not push there cache contents with other nodes. if that the case is it also true that when a new node is added to the cluster it has to build up its own cache. if thats the case i see that as a possible performance bottle neck once the node starts to accept requests. since there is no way i know of to warm the cache without adding the node to the cluster. would it be infeasible to have part of the bootstrap process not only stream data from nodes but also cached rows that are associated with those same keys? that would allow the new nodes to be able to provide the best performance once the bootstrap process finishes.