Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: error (athena.apache.org: encountered temporary error during SPF
 processing of domain of oberman@civicscience.com)
MIME-Version: 1.0
In-Reply-To: <87166F6C-71A5-4677-89F0-0E3765FBF6D0@ecyrd.com>
References: 
 <CAAjbL_nR4F3ZCMsHYR8FFyK9WwV-Eodme=EAOTpFeDonciAWYA@mail.gmail.com>
 <11010AD8-0A80-4345-845C-C2BF7AD9BE27@thelastpickle.com>
 <CAAjbL_mmK5H3aroD-Ss6_BqFPkoEXH9Bc93X1pDjvieX99tFyQ@mail.gmail.com>
 <6A153C4D-3E72-4105-BE4D-D7C11FEE56E7@thelastpickle.com>
 <87166F6C-71A5-4677-89F0-0E3765FBF6D0@ecyrd.com>
From: William Oberman <oberman@civicscience.com>
Date: Wed, 1 May 2013 17:22:01 -0400
Message-ID: 
 <CAAjbL_n=128Avah1883_r+_nR53=CfA330jU9CvGG=6NEVjDKw@mail.gmail.com>
Subject: Re: normal thread counts?
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=bcaec51dd3c7a25eb804dbaeb61b

--bcaec51dd3c7a25eb804dbaeb61b
Content-Type: text/plain; charset=ISO-8859-1

That has GOT to be it.  1.1.10 upgrade it is...


On Wed, May 1, 2013 at 5:09 PM, Janne Jalkanen <Janne.Jalkanen@ecyrd.com>wrote:

>
> This sounds very much like
> https://issues.apache.org/jira/browse/CASSANDRA-5175, which was fixed in
> 1.1.10.
>
> /Janne
>
> On Apr 30, 2013, at 23:34 , aaron morton <aaron@thelastpickle.com> wrote:
>
>  Many many many of the threads are trying to talk to IPs that aren't in
> the cluster (I assume they are the IP's of dead hosts).
>
> Are these IP's from before the upgrade ? Are they IP's you expect to see ?
>
> Cross reference them with the output from nodetool gossipinfo to see why
> the node thinks they should be used.
> Could you provide a list of the thread names ?
>
> One way to remove those IPs that may be to rolling restart with
> -Dcassandra.load_ring_state=false i the JVM opts at the bottom of
> cassandra-env.sh
>
> The OutboundTcpConnection threads are created in pairs by the
> OutboundTcpConnectionPool, which is created here
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502 The
> threads are created in the OutboundTcpConnectionPool constructor checking
> to see if this could be the source of the leak.
>
> Cheers
>
>    -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 1/05/2013, at 2:18 AM, William Oberman <oberman@civicscience.com>
> wrote:
>
> I use phpcassa.
>
> I did a thread dump.  99% of the threads look very similar (I'm using
> 1.1.9 in terms of matching source lines).  The thread names are all like
> this: "WRITE-/10.x.y.z".  There are a LOT of duplicates (in terms of the
> same IP).  Many many many of the threads are trying to talk to IPs that
> aren't in the cluster (I assume they are the IP's of dead hosts).  The
> stack trace is basically the same for them all, attached at the bottom.
>
> There is a lot of things I could talk about in terms of my situation, but
> what I think might be pertinent to this thread: I hit a "tipping point"
> recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge
> (rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of
> the 9 keep struggling.  I've replaced them many times now, each time using
> this process:
> http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
> And even this morning the only two nodes with a high number of threads are
> those two (yet again).  And at some point they'll OOM.
>
> Seems like there is something about my cluster (caused by the recent
> upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't
> know how to escape from the trap.  Any ideas?
>
>
> --------
>   stackTrace = [ {
>     className = sun.misc.Unsafe;
>     fileName = Unsafe.java;
>     lineNumber = -2;
>      methodName = park;
>     nativeMethod = true;
>    }, {
>     className = java.util.concurrent.locks.LockSupport;
>     fileName = LockSupport.java;
>     lineNumber = 158;
>     methodName = park;
>     nativeMethod = false;
>    }, {
>     className =
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
>     fileName = AbstractQueuedSynchronizer.java;
>     lineNumber = 1987;
>     methodName = await;
>     nativeMethod = false;
>    }, {
>     className = java.util.concurrent.LinkedBlockingQueue;
>     fileName = LinkedBlockingQueue.java;
>     lineNumber = 399;
>     methodName = take;
>     nativeMethod = false;
>    }, {
>     className = org.apache.cassandra.net.OutboundTcpConnection;
>     fileName = OutboundTcpConnection.java;
>     lineNumber = 104;
>     methodName = run;
>     nativeMethod = false;
>    } ];
> ----------
>
>
>
>
> On Mon, Apr 29, 2013 at 4:31 PM, aaron morton <aaron@thelastpickle.com>wrote:
>
>>  I used JMX to check current number of threads in a production cassandra
>> machine, and it was ~27,000.
>>
>> That does not sound too good.
>>
>> My first guess would be lots of client connections. What client are you
>> using, does it do connection pooling ?
>> See the comments in cassandra.yaml around rpc_server_type, the default
>> uses sync uses one thread per connection, you may be better with HSHA. But
>> if your app is leaking connection you should probably deal with that first.
>>
>> Cheers
>>
>>    -----------------
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 30/04/2013, at 3:07 AM, William Oberman <oberman@civicscience.com>
>> wrote:
>>
>> Hi,
>>
>> I'm having some issues.  I keep getting:
>> ------------
>> ERROR [GossipStage:1] 2013-04-28 07:48:48,876
>> AbstractCassandraDaemon.java (line 135) Exception in thread
>> Thread[GossipStage:1,5,main]
>> java.lang.OutOfMemoryError: unable to create new native thread
>> --------------
>> after a day or two of runtime.  I've checked and my system settings seem
>> acceptable:
>> memlock=unlimited
>> nofiles=100000
>> nproc=122944
>>
>> I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS),
>> and I keep OOM'ing with the above error.
>>
>> I've found some (what seem to me) to be obscure references to the stack
>> size interacting with # of threads.  If I'm understanding it correctly, to
>> reason about Java mem usage I have to think of OS + Heap as being locked
>> down, and the stack gets the "leftovers" of physical memory and each thread
>> gets a stack.
>>
>> For me, the system ulimit setting on stack is 10240k (no idea if java
>> sees or respects this setting).  My -Xss for cassandra is the default (I
>> hope, don't remember messing with it) of 180k.  I used JMX to check current
>> number of threads in a production cassandra machine, and it was ~27,000.
>>  Is that a normal thread count?  Could my OOM be related to stack + number
>> of threads, or am I overlooking something more simple?
>>
>> will
>>
>>
>>
>
>
>
>
>
>

--bcaec51dd3c7a25eb804dbaeb61b
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">That has GOT to be it. =A01.1.10 upgrade it is...<br><div =
class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed, May 1, 201=
3 at 5:09 PM, Janne Jalkanen <span dir=3D"ltr">&lt;<a href=3D"mailto:Janne.=
Jalkanen@ecyrd.com" target=3D"_blank">Janne.Jalkanen@ecyrd.com</a>&gt;</spa=
n> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word"><div><br=
></div><div>This sounds very much like=A0<a href=3D"https://issues.apache.o=
rg/jira/browse/CASSANDRA-5175" target=3D"_blank">https://issues.apache.org/=
jira/browse/CASSANDRA-5175</a>, which was fixed in 1.1.10.</div>

<span class=3D"HOEnZb"><font color=3D"#888888"><div><br></div><div>/Janne</=
div></font></span><div><div class=3D"h5"><br><div><div>On Apr 30, 2013, at =
23:34 , aaron morton &lt;<a href=3D"mailto:aaron@thelastpickle.com" target=
=3D"_blank">aaron@thelastpickle.com</a>&gt; wrote:</div>

<br><blockquote type=3D"cite"><div style=3D"word-wrap:break-word"><blockquo=
te type=3D"cite"><div dir=3D"ltr">=A0Many many many of the threads are tryi=
ng to talk to IPs that aren&#39;t in the cluster (I assume they are the IP&=
#39;s of dead hosts).=A0</div>

</blockquote><div>Are these IP&#39;s from before the upgrade ? Are they IP&=
#39;s you expect to see ?=A0</div><div><br></div><div>Cross reference them =
with the output from nodetool gossipinfo to see why the node thinks they sh=
ould be used.=A0</div>

<div>Could you provide a list of the thread names ?=A0</div><div><br></div>=
One way to remove those IPs that may be to rolling restart with -Dcassandra=
.load_ring_state=3Dfalse i the JVM opts at the bottom of cassandra-env.sh<d=
iv>

<br></div><div>The OutboundTcpConnection threads are created in pairs by th=
e OutboundTcpConnectionPool, which is created here=A0<a href=3D"https://git=
hub.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/Messa=
gingService.java#L502" target=3D"_blank">https://github.com/apache/cassandr=
a/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502</=
a>=A0The threads are created in the=A0OutboundTcpConnectionPool constructor=
 checking to see if this could be the source of the leak.=A0</div>

<div><br></div><div>Cheers</div><div><br><div>
<div style=3D"font-family:Helvetica;font-size:medium;font-style:normal;font=
-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal=
;text-align:-webkit-auto;text-indent:0px;text-transform:none;white-space:no=
rmal;word-spacing:0px;word-wrap:break-word">

<div style=3D"font-family:Helvetica;font-size:medium;font-style:normal;font=
-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal=
;text-align:-webkit-auto;text-indent:0px;text-transform:none;white-space:no=
rmal;word-spacing:0px;word-wrap:break-word">

<span style=3D"border-collapse:separate;border-spacing:0px"><div style=3D"w=
ord-wrap:break-word"><span style=3D"border-collapse:separate;font-family:He=
lvetica;font-style:normal;font-variant:normal;font-weight:normal;letter-spa=
cing:normal;line-height:normal;text-indent:0px;text-transform:none;white-sp=
ace:normal;word-spacing:0px;border-spacing:0px;font-size:medium"><div style=
=3D"word-wrap:break-word">

<span style=3D"border-collapse:separate;font-family:Helvetica;font-style:no=
rmal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-heig=
ht:normal;text-indent:0px;text-transform:none;white-space:normal;word-spaci=
ng:0px;border-spacing:0px;font-size:medium"><div style=3D"word-wrap:break-w=
ord">

<span style=3D"border-collapse:separate;font-family:Helvetica;font-style:no=
rmal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-heig=
ht:normal;text-indent:0px;text-transform:none;white-space:normal;word-spaci=
ng:0px;border-spacing:0px;font-size:medium"><div style=3D"word-wrap:break-w=
ord">

<div>-----------------</div><div>Aaron Morton</div><div>Freelance Cassandra=
 Consultant</div><div>New Zealand</div><div><br></div><div>@aaronmorton</di=
v><div><a href=3D"http://www.thelastpickle.com/" target=3D"_blank">http://w=
ww.thelastpickle.com</a></div>

</div></span></div></span></div></span></div></span></div></div>
</div>
<br><div><div>On 1/05/2013, at 2:18 AM, William Oberman &lt;<a href=3D"mail=
to:oberman@civicscience.com" target=3D"_blank">oberman@civicscience.com</a>=
&gt; wrote:</div><br><blockquote type=3D"cite"><div dir=3D"ltr">I use phpca=
ssa.<div>

<br></div><div>I did a thread dump. =A099% of the threads look very similar=
 (I&#39;m using 1.1.9 in terms of matching source lines). =A0The thread nam=
es are all like this: &quot;WRITE-/10.x.y.z&quot;. =A0There are a LOT of du=
plicates (in terms of the same IP). =A0Many many many of the threads are tr=
ying to talk to IPs that aren&#39;t in the cluster (I assume they are the I=
P&#39;s of dead hosts). =A0The stack trace is basically the same for them a=
ll, attached at the bottom. =A0=A0</div>


<div><br></div><div>There is a lot of things I could talk about in terms of=
 my situation, but what I think might be=A0pertinent=A0to this thread: I hi=
t a &quot;tipping point&quot; recently and upgraded a 9 node cluster from A=
WS m1.large to m1.xlarge (rolling, one at a time). =A07 of the 9 upgraded f=
ine and work great. =A02 of the 9 keep struggling. =A0I&#39;ve replaced the=
m many times now, each time using this process:</div>


<div><a href=3D"http://www.datastax.com/docs/1.1/cluster_management#replaci=
ng-a-dead-node" target=3D"_blank">http://www.datastax.com/docs/1.1/cluster_=
management#replacing-a-dead-node</a><br></div><div>And even this morning th=
e only two nodes with a high number of threads are those two (yet again). =
=A0And at some point they&#39;ll OOM.</div>


<div><br></div><div>Seems like there is something about my cluster (caused =
by the recent upgrade?) that causes a thread leak on=A0OutboundTcpConnectio=
n=A0 =A0But I don&#39;t know how to escape from the trap. =A0Any ideas?</di=
v>

<div><br></div><div><br></div><div>--------</div><div><div>=A0 stackTrace =
=3D [ {=A0</div><div>=A0 =A0 className =3D sun.misc.Unsafe;</div><div>=A0 =
=A0 fileName =3D Unsafe.java;</div><div>=A0 =A0 lineNumber =3D -2;</div>
<div>
=A0 =A0 methodName =3D park;</div><div>=A0 =A0 nativeMethod =3D true;</div>=
<div>=A0 =A0}, {=A0</div><div>=A0 =A0 className =3D java.util.concurrent.lo=
cks.LockSupport;</div><div>=A0 =A0 fileName =3D LockSupport.java;</div><div=
>=A0 =A0 lineNumber =3D 158;</div>


<div>=A0 =A0 methodName =3D park;</div><div>=A0 =A0 nativeMethod =3D false;=
</div><div>=A0 =A0}, {=A0</div><div>=A0 =A0 className =3D java.util.concurr=
ent.locks.AbstractQueuedSynchronizer$ConditionObject;</div><div>=A0 =A0 fil=
eName =3D AbstractQueuedSynchronizer.java;</div>


<div>=A0 =A0 lineNumber =3D 1987;</div><div>=A0 =A0 methodName =3D await;</=
div><div>=A0 =A0 nativeMethod =3D false;</div><div>=A0 =A0}, {=A0</div><div=
>=A0 =A0 className =3D java.util.concurrent.LinkedBlockingQueue;</div><div>=
=A0 =A0 fileName =3D LinkedBlockingQueue.java;</div>


<div>=A0 =A0 lineNumber =3D 399;</div><div>=A0 =A0 methodName =3D take;</di=
v><div>=A0 =A0 nativeMethod =3D false;</div><div>=A0 =A0}, {=A0</div><div>=
=A0 =A0 className =3D org.apache.cassandra.net.OutboundTcpConnection;</div>=
<div>=A0 =A0 fileName =3D OutboundTcpConnection.java;</div>


<div>=A0 =A0 lineNumber =3D 104;</div><div>=A0 =A0 methodName =3D run;</div=
><div>=A0 =A0 nativeMethod =3D false;</div><div>=A0 =A0} ];</div><div>-----=
-----</div><div><br></div><div><br></div></div><div class=3D"gmail_extra"><=
br><br><div class=3D"gmail_quote">


On Mon, Apr 29, 2013 at 4:31 PM, aaron morton <span dir=3D"ltr">&lt;<a href=
=3D"mailto:aaron@thelastpickle.com" target=3D"_blank">aaron@thelastpickle.c=
om</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div style=3D"word-wrap:break-word"><div><blockquote type=3D"cite"><div dir=
=3D"ltr">=A0I used JMX to check current number of threads in a production c=
assandra machine, and it was ~27,000.</div></blockquote></div>That does not=
 sound too good.=A0<div>


<br></div><div>My first guess would be lots of client connections. What cli=
ent are you using, does it do connection pooling ?</div><div>See the commen=
ts in cassandra.yaml around rpc_server_type, the default uses sync uses one=
 thread per connection, you may be better with HSHA. But if your app is lea=
king connection you should probably deal with that first.=A0</div>


<div><br></div><div>Cheers</div><div><br><div>
<div style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;tex=
t-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norma=
l;text-transform:none;font-size:medium;white-space:normal;font-family:Helve=
tica;word-wrap:break-word;word-spacing:0px">


<div style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;tex=
t-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norma=
l;text-transform:none;font-size:medium;white-space:normal;font-family:Helve=
tica;word-wrap:break-word;word-spacing:0px">


<span style=3D"border-collapse:separate;border-spacing:0px"><div style=3D"w=
ord-wrap:break-word"><span style=3D"border-spacing:0px;text-indent:0px;lett=
er-spacing:normal;font-variant:normal;font-style:normal;font-weight:normal;=
line-height:normal;border-collapse:separate;text-transform:none;font-size:m=
edium;white-space:normal;font-family:Helvetica;word-spacing:0px"><div style=
=3D"word-wrap:break-word">


<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;font-style:normal;font-weight:normal;line-height:normal;bo=
rder-collapse:separate;text-transform:none;font-size:medium;white-space:nor=
mal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:break-w=
ord">


<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;font-style:normal;font-weight:normal;line-height:normal;bo=
rder-collapse:separate;text-transform:none;font-size:medium;white-space:nor=
mal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:break-w=
ord">


<div>-----------------</div><div>Aaron Morton</div><div>Freelance Cassandra=
 Consultant</div><div>New Zealand</div><div><br></div><div>@aaronmorton</di=
v><div><a href=3D"http://www.thelastpickle.com/" target=3D"_blank">http://w=
ww.thelastpickle.com</a></div>


</div></span></div></span></div></span></div></span></div></div>
</div><div><div>
<br><div><div>On 30/04/2013, at 3:07 AM, William Oberman &lt;<a href=3D"mai=
lto:oberman@civicscience.com" target=3D"_blank">oberman@civicscience.com</a=
>&gt; wrote:</div><br><blockquote type=3D"cite"><div dir=3D"ltr">Hi,<div><b=
r></div>


<div>I&#39;m having some issues. =A0I keep getting:</div><div>------------<=
/div><div><div>ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassan=
draDaemon.java (line 135) Exception in thread Thread[GossipStage:1,5,main]<=
/div>


<div>java.lang.OutOfMemoryError: unable to create new native thread</div></=
div><div>--------------</div><div>after a day or two of runtime. =A0I&#39;v=
e checked and my system settings seem acceptable:</div><div>memlock=3Dunlim=
ited</div>


<div>nofiles=3D100000</div><div>nproc=3D122944</div><div><br></div><div>I&#=
39;ve messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), a=
nd I keep OOM&#39;ing with the above error.</div><div>

<br></div><div>I&#39;ve found some (what seem to me) to be obscure referenc=
es to the stack size interacting with # of threads. =A0If I&#39;m understan=
ding it correctly, to reason about Java mem usage I have to think of OS + H=
eap as being locked down, and the stack gets the &quot;leftovers&quot; of p=
hysical memory and each thread gets a stack.</div>


<div><br></div><div>For me, the system ulimit setting on stack is=A010240k =
(no idea if java sees or respects this setting). =A0My -Xss for cassandra i=
s the default (I hope, don&#39;t remember messing with it) of 180k. =A0I us=
ed JMX to check current number of threads in a production cassandra machine=
, and it was ~27,000. =A0Is that a normal thread count? =A0Could my OOM be =
related to stack + number of threads, or am I overlooking something more si=
mple?</div>


<div><br></div><div>will</div><div><br></div></div>
</blockquote></div><br></div></div></div></div></blockquote></div><br><br c=
lear=3D"all"><div><br></div><br></div></div>
</blockquote></div><br></div></div></blockquote></div><br></div></div></div=
></blockquote></div><br><br clear=3D"all"><div><br></div></div></div>

--bcaec51dd3c7a25eb804dbaeb61b--