cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Random slow connects.
Date Mon, 18 Jun 2012 02:00:23 GMT
You could also try adding some logging in the client to track down the exactly where the delay
is. If it is in waiting for the socket to open on the server or say managing the connection
client side.

Cheers
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 15/06/2012, at 4:51 AM, Tyler Hobbs wrote:

> As a random guess, you might want to check your open file descriptor limit on the C*
servers.  Use "cat /proc/<pid>/limits", where <pid> is the pid of the Cassandra
process; it's the most reliable way to check this.
> 
> On Thu, Jun 14, 2012 at 10:43 AM, Henrik Schröder <skrolle@gmail.com> wrote:
> Hi Mina,
> 
> The delay is not constant, in the absolute majority of cases, connecting is almost instant,
but occasionally, connecting to a server takes a few seconds.
> 
> We can't even reproduce it reliably, we can see in our server logs that sometimes, maybe
a few times a day, maybe once every few days, a cassandra server will be slow in accepting
connections, and after a little while everything will be ok again. It's not a network saturation
error, it's not a CPU saturation error. Not even GC pauses.
> 
> Has anyone else noticed something similar? Or is this simply a result of us running a
tight connection pool which recycles connections every few hours and only waits a few seconds
for a connection before timing out?
> 
> 
> /Henrik
> 
> 
> On Thu, Jun 14, 2012 at 4:54 PM, Mina Naguib <mina.naguib@bloomdigital.com> wrote:
> 
> On 2012-06-14, at 10:38 AM, Henrik Schröder wrote:
> 
> > Hi everyone,
> >
> > We have problem with our Cassandra cluster, and that is that sometimes it takes
several seconds to open a new Thrift connection to the server. We've had this issue when we
ran on windows, and we have this issue now that we run on Ubuntu. We've had it with our old
networking setup, and we have it with our new networking setup where we're running it over
a dedicated gigabit network. Normally estabishing a new connection is instant, but once in
a while it seems like it's not accepting any new connections until three seconds have passed.
> >
> > We're of course running a connection-pooling client which mitigates this, since
once a connection is established, it's rock solid.
> >
> > We tried switching the rpc_server_type to hsha, but that seems to have made the
problem worse, we're seeing more connection timeouts because of this.
> >
> > For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu, and our connection
pool is configured to abort a connection attempt after two seconds, and each connection lives
for six hours and then it's recycled. Under current load we do about 500 writes/s and 100
reads/s, we have 20 clients, but each has a very small connection pool of maybe up to 5 simultaneous
connections against each Cassandra server. We see these connection issues maybe once a day,
but always at random intervals.
> >
> > We've tried to get more information through Datastax Opscenter, the JMX console,
and our own application monitoring and logging, but we can't see anything out of the ordinary.
Sometimes, seemingly by random, it's just really slow to connect. We're all out of ideas.
Does anyone here have suggestions on where to look and what to do next?
> 
> Have you ironed out non-cassandra potential causes ?
> 
> 3 seconds constantly sounds it could be a timeout/retry somewhere.  Do you contact cassandra
via a hostname or IP address ?  If via hostname, iron out DNS.
> 
> Either way, I'd fire up tcpdump, both on both the client and the server, and observe
the TCP handshake.  Specifically see if the SYN packet is sent and received, whether the SYN-ACK
is sent back right away and received, and final ACK.
> 
> If that looks good, then TCP-wise you're in good shape and the problem is in a higher
layer (thrift).  If not, see where the delay/drop/retry happens.  If it's in the first packet,
it may be a networking/routing issue.  If in the second, it may me capacity at the server
(investigate with lsof/netstat/JMX), etc..
> 
> 
> 
> 
> 
> 
> -- 
> Tyler Hobbs
> DataStax
> 


Mime
View raw message