cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Campos <>
Subject Re: Lots of simultaneous connections?
Date Thu, 14 Dec 2017 19:38:10 GMT
Hi Kurt, thanks for your reply — really appreciate your (and everyone else’s!) continual
assistance of people in the C* user community.

All of these clients & servers are on the same (internal) network, so there is no firewall
between the clients & servers.  

Our C* application is a QA test results system.  We have thousands of machines in-house which
we use to test the software (not C* related) which we sell, and we’re using C* to capture
the results of those tests.

So the flow is:
On each machine (~2500):
… we run tests (~5-20 per machine)
… each test has ~8 steps
… each step makes a connection to the DB, logs to C* the start time, sub invokes the test
script which runs the step (2 mins to 20 hours — no C* usage during this part), and then
captures the result to C* (end time, exit status, etc).

Today we’re not disconnecting C* during the “run the step” part - and we’re getting
OperationTimedOut errors as we scale up the number of tests executing using our C* application.
 My theory is that we’re overwhelming C* with the sheer number of (mostly idle) connections
to our 3-node cluster.

I’m hoping someone has seen this sort of problem and can say “Yeah, that’s too many
connections — I’m sure that’s your problem.”  or “We regularly make 12M connections
per C* node — you’re screwed up in some other way — have you checked file descriptor
limits?  What’s your Java __whatever__ setting?”  etc.

thanks Kurt.  :-)

- Max

> On Dec 14, 2017, at 6:19 am, kurt greaves < <>>
> I see time outs and I immediately blame firewalls. Have you triple checked then?
> Is this only occurring to a subset of clients?
> Also, 3.0.6 is pretty dated and has many bugs, you should definitely upgrade to the latest
3.0 (don't forget to read news.txt)
> On 14 Dec. 2017 19:18, "Max Campos" < <>>
> Hi -
> We’re finally putting our new application under load, and we’re starting to get this
error message from the Python driver when under heavy load:
> ('Unable to connect to any servers', {‘x.y.z.205': OperationTimedOut('errors=None,
last_host=None',), ‘x.y.z.204': OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206':
OperationTimedOut('errors=None, last_host=None',)})' (22.7s)
> Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM reads/writes.  We
have a few thousand machines which are each making 1-10 connections to C* at once, but each
of these connections only reads/writes a few records, waits several minutes, and then writes
a few records — so while netstat reports ~5K connections per node, they’re generally idle.
 Peak read/sec today was ~1500 per node, peak writes/sec was ~300 per node.  Read/write latencies
peaked at 2.5ms.
> Some questions:
> 1) Is anyone else out there making this many simultaneous connections?  Any idea what
a reasonable number of connections is, what is too many, etc?
> 2) Any thoughts on which JMX metrics I should look at to better understand what exactly
is exploding?  Is there a “number of active connections” metric?  We currently look at:
> - client reads/writes per sec
> - read/write latency
> - compaction tasks
> - repair tasks
> - disk used by node
> - disk used by table
> - avg partition size per table
> 3) Any other advice?
> I think I’ll try doing an explicit disconnect during the waiting period of our application’s
execution; so as to get the C* connection count down.  Hopefully that will solve the timeout
> Thanks for your help.
> - Max
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: <>
> For additional commands, e-mail: <>

View raw message