Final reason for problem:

We'd had one node's config for rpc type changed from sync to hsha...  

So that mismatch can break rpc across the cluster, apparently.  

It would be nice if there was a good way to set that in a single spot for the cluster or handle the mismatch differently.  Otherwise, if you wanted to change from sync to hsha in a cluster you'd have to entirely restart the cluster (not a big deal), but CQL would apparently not work at all until all of your nodes had been restarted.


On Fri, Mar 29, 2013 at 10:35 AM, David McNelis <dmcnelis@gmail.com> wrote:
Appears that restarting a node makes CQL available on that node again, but only that node.

Looks like I'll be doing a rolling restart.


On Fri, Mar 29, 2013 at 10:26 AM, David McNelis <dmcnelis@gmail.com> wrote:
I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in my cluster.

I'd had a large insert job running the last several days which just ended.... it had been inserting using cql3 insert statements in a cql3 table.

Now, I show no compactions going on in my cluster but for some reason any cql3 query I try to execute, insert, select, through cqlsh or through external library, all time out with an rpc_timeout.

If I use cassandra-cli, I can do "list tablename limit 10" and immediately get my 10 rows back.

However, if I do "select * from tablename limit 10" I get the rpc timeout error.  Same table, same server.  It doesn't seem to matter if I'm hitting a cql3 definited table or older style.

Load on the nodes is relatively low at the moment. 

Any suggestions short of restarting nodes?  This is a pretty major issue for us right now.