Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B4905FF13 for ; Tue, 30 Apr 2013 14:19:55 +0000 (UTC) Received: (qmail 95421 invoked by uid 500); 30 Apr 2013 14:19:52 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 95402 invoked by uid 500); 30 Apr 2013 14:19:52 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 95391 invoked by uid 99); 30 Apr 2013 14:19:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Apr 2013 14:19:52 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: encountered temporary error during SPF processing of domain of oberman@civicscience.com) Received: from [209.85.210.41] (HELO mail-da0-f41.google.com) (209.85.210.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Apr 2013 14:19:46 +0000 Received: by mail-da0-f41.google.com with SMTP id p8so268865dan.0 for ; Tue, 30 Apr 2013 07:19:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:x-originating-ip:in-reply-to:references :from:date:message-id:subject:to:content-type:x-gm-message-state; bh=YPgCgU4+CJHyi15OVHc5F5NOYnVqXczFFYvdNOcgwa4=; b=L58nLxzUt/inyUt+3d0yg6h+H7VIaOcGo2VKEwVRlzunJ8Q0DP+HSv/D6AaRUKjXzP Isvo5O8V98Qbd/IjCzaf4E0q2elM2cz1CoBoLfpIscbWD/RfPd81bpYlILwjJk+fXPO6 6dYRqNtBZxVkONIr0VvybiUeBcSXTU7fMy4ljn7GlQEPOFv5khMYdySIh6otOOLMM9f8 HwlIDJnyHqscRn0pNS04H3VY3XNlI/mDJU/BEI/CqkcNzBC7OEb0or7+4rPPWZ/z+QUd q4YqREaZjXRNsE14fwAv4rure8kdA+8G5l3nPPeGwS9NkmSnolzhfBaImAZoAH+xZ8gf KUzw== X-Received: by 10.66.248.193 with SMTP id yo1mr48967344pac.120.1367331539139; Tue, 30 Apr 2013 07:18:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.132.229 with HTTP; Tue, 30 Apr 2013 07:18:39 -0700 (PDT) X-Originating-IP: [71.206.217.47] In-Reply-To: <11010AD8-0A80-4345-845C-C2BF7AD9BE27@thelastpickle.com> References: <11010AD8-0A80-4345-845C-C2BF7AD9BE27@thelastpickle.com> From: William Oberman Date: Tue, 30 Apr 2013 10:18:39 -0400 Message-ID: Subject: Re: normal thread counts? To: "user@cassandra.apache.org" Content-Type: multipart/alternative; boundary=047d7b15b01dae557704db94ae92 X-Gm-Message-State: ALoCoQmgdfv/IMvaPvU9bfgkRvhor5Ep0wAXTE0hwliBpJrpbHlQwboX13oYUoyAktjVB+8jrTuT X-Virus-Checked: Checked by ClamAV on apache.org --047d7b15b01dae557704db94ae92 Content-Type: text/plain; charset=ISO-8859-1 I use phpcassa. I did a thread dump. 99% of the threads look very similar (I'm using 1.1.9 in terms of matching source lines). The thread names are all like this: "WRITE-/10.x.y.z". There are a LOT of duplicates (in terms of the same IP). Many many many of the threads are trying to talk to IPs that aren't in the cluster (I assume they are the IP's of dead hosts). The stack trace is basically the same for them all, attached at the bottom. There is a lot of things I could talk about in terms of my situation, but what I think might be pertinent to this thread: I hit a "tipping point" recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge (rolling, one at a time). 7 of the 9 upgraded fine and work great. 2 of the 9 keep struggling. I've replaced them many times now, each time using this process: http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node And even this morning the only two nodes with a high number of threads are those two (yet again). And at some point they'll OOM. Seems like there is something about my cluster (caused by the recent upgrade?) that causes a thread leak on OutboundTcpConnection But I don't know how to escape from the trap. Any ideas? -------- stackTrace = [ { className = sun.misc.Unsafe; fileName = Unsafe.java; lineNumber = -2; methodName = park; nativeMethod = true; }, { className = java.util.concurrent.locks.LockSupport; fileName = LockSupport.java; lineNumber = 158; methodName = park; nativeMethod = false; }, { className = java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject; fileName = AbstractQueuedSynchronizer.java; lineNumber = 1987; methodName = await; nativeMethod = false; }, { className = java.util.concurrent.LinkedBlockingQueue; fileName = LinkedBlockingQueue.java; lineNumber = 399; methodName = take; nativeMethod = false; }, { className = org.apache.cassandra.net.OutboundTcpConnection; fileName = OutboundTcpConnection.java; lineNumber = 104; methodName = run; nativeMethod = false; } ]; ---------- On Mon, Apr 29, 2013 at 4:31 PM, aaron morton wrote: > I used JMX to check current number of threads in a production cassandra > machine, and it was ~27,000. > > That does not sound too good. > > My first guess would be lots of client connections. What client are you > using, does it do connection pooling ? > See the comments in cassandra.yaml around rpc_server_type, the default > uses sync uses one thread per connection, you may be better with HSHA. But > if your app is leaking connection you should probably deal with that first. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 30/04/2013, at 3:07 AM, William Oberman > wrote: > > Hi, > > I'm having some issues. I keep getting: > ------------ > ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java > (line 135) Exception in thread Thread[GossipStage:1,5,main] > java.lang.OutOfMemoryError: unable to create new native thread > -------------- > after a day or two of runtime. I've checked and my system settings seem > acceptable: > memlock=unlimited > nofiles=100000 > nproc=122944 > > I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), > and I keep OOM'ing with the above error. > > I've found some (what seem to me) to be obscure references to the stack > size interacting with # of threads. If I'm understanding it correctly, to > reason about Java mem usage I have to think of OS + Heap as being locked > down, and the stack gets the "leftovers" of physical memory and each thread > gets a stack. > > For me, the system ulimit setting on stack is 10240k (no idea if java sees > or respects this setting). My -Xss for cassandra is the default (I hope, > don't remember messing with it) of 180k. I used JMX to check current > number of threads in a production cassandra machine, and it was ~27,000. > Is that a normal thread count? Could my OOM be related to stack + number > of threads, or am I overlooking something more simple? > > will > > > --047d7b15b01dae557704db94ae92 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I use phpcassa.

I did a thread du= mp. =A099% of the threads look very similar (I'm using 1.1.9 in terms o= f matching source lines). =A0The thread names are all like this: "WRIT= E-/10.x.y.z". =A0There are a LOT of duplicates (in terms of the same I= P). =A0Many many many of the threads are trying to talk to IPs that aren= 9;t in the cluster (I assume they are the IP's of dead hosts). =A0The s= tack trace is basically the same for them all, attached at the bottom. =A0= =A0

There is a lot of things I could talk about= in terms of my situation, but what I think might be=A0pertinent=A0to this = thread: I hit a "tipping point" recently and upgraded a 9 node cl= uster from AWS m1.large to m1.xlarge (rolling, one at a time). =A07 of the = 9 upgraded fine and work great. =A02 of the 9 keep struggling. =A0I've = replaced them many times now, each time using this process:
And even this morning the only= two nodes with a high number of threads are those two (yet again). =A0And = at some point they'll OOM.

Seems like there is something about my clus= ter (caused by the recent upgrade?) that causes a thread leak on=A0Outbound= TcpConnection=A0 =A0But I don't know how to escape from the trap. =A0An= y ideas?


--------
=A0 stackTrace =3D [ {=A0
=A0 =A0 className =3D sun.misc.= Unsafe;
=A0 =A0 fileName =3D Unsafe.java;
=A0 =A0 lineN= umber =3D -2;
=A0 =A0 methodName =3D park;
=A0 =A0 nativeMethod =3D true;
=
=A0 =A0}, {=A0
=A0 =A0 className =3D java.util.concurrent.lo= cks.LockSupport;
=A0 =A0 fileName =3D LockSupport.java;
=A0 =A0 lineNumber =3D 158;
=A0 =A0 methodName =3D park;
=A0 =A0 nativeMethod =3D false;=
=A0 =A0}, {=A0
=A0 =A0 className =3D java.util.concurr= ent.locks.AbstractQueuedSynchronizer$ConditionObject;
=A0 =A0 fil= eName =3D AbstractQueuedSynchronizer.java;
=A0 =A0 lineNumber =3D 1987;
=A0 =A0 methodName =3D await;
=A0 =A0 nativeMethod =3D false;
=A0 =A0}, {=A0
=A0 =A0 className =3D java.util.concurrent.LinkedBlockingQueue;
= =A0 =A0 fileName =3D LinkedBlockingQueue.java;
=A0 =A0 lineNumber =3D 399;
=A0 =A0 methodName =3D take;
=A0 =A0 nativeMethod =3D false;
=A0 =A0}, {=A0
= =A0 =A0 className =3D org.apache.cassandra.net.OutboundTcpConnection;
=
=A0 =A0 fileName =3D OutboundTcpConnection.java;
=A0 =A0 lineNumber =3D 104;
=A0 =A0 methodName =3D run;
=A0 =A0 nativeMethod =3D false;
=A0 =A0} ];
-----= -----


<= br>
On Mon, Apr 29, 2013 at 4:31 PM, aaron morton <aaron@thelastpickle.c= om> wrote:
=A0I used JMX to check current number of threads in a= production cassandra machine, and it was ~27,000.
= That does not sound too good.=A0

My first guess would be lots of client connections. What cli= ent are you using, does it do connection pooling ?
See the commen= ts in cassandra.yaml around rpc_server_type, the default uses sync uses one= thread per connection, you may be better with HSHA. But if your app is lea= king connection you should probably deal with that first.=A0

Cheers

-----------------
Aaron Morton
Freelance Cassandra= Consultant
New Zealand





--047d7b15b01dae557704db94ae92--