Thanks for the feedback guys. That example data model was indeed abbreviated - the real queries have the partition key in them. I am using RF 3 on the keyspace, so I don't think a node being down would mean the key I'm looking for would be unavailable. The load balancing policy of the driver seems correct (https://docs.datastax.com/en/developer/nodejs-driver/3.4/features/tuning-policies/#load-balancing-policy
, and I am using the default `TokenAware` policy with `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the implementation.
It was an oversight of mine to not include `nodetool disablebinary`, but I still experience the same issue with that.
One other thing I've noticed is that after restarting a node and seeing application latency, I also see that the node I just restarted sees many other nodes in the same DC as being down (ie status 'DN'). However, checking `nodetool status` on those other nodes shows all nodes as up/normal. To me this could kind of explain the problem - node comes back online, thinks it is healthy but many others are not, so it gets traffic from the client application. But then it gets requests for ranges that belong to a node it thinks is down, so it responds with an error. The latency issue seems to start roughly when the node goes down, but persists long (ie 15-20 mins) after it is back online and accepting connections. It seems to go away once the bounced node shows the other nodes in the same DC as up again.
As for speculative retry, my CF is using the default of '99th percentile'. I could try something different there, but nodes being seen as down seems like an issue.