incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Oberman <ober...@civicscience.com>
Subject Re: normal thread counts?
Date Wed, 01 May 2013 21:22:01 GMT
That has GOT to be it.  1.1.10 upgrade it is...


On Wed, May 1, 2013 at 5:09 PM, Janne Jalkanen <Janne.Jalkanen@ecyrd.com>wrote:

>
> This sounds very much like
> https://issues.apache.org/jira/browse/CASSANDRA-5175, which was fixed in
> 1.1.10.
>
> /Janne
>
> On Apr 30, 2013, at 23:34 , aaron morton <aaron@thelastpickle.com> wrote:
>
>  Many many many of the threads are trying to talk to IPs that aren't in
> the cluster (I assume they are the IP's of dead hosts).
>
> Are these IP's from before the upgrade ? Are they IP's you expect to see ?
>
> Cross reference them with the output from nodetool gossipinfo to see why
> the node thinks they should be used.
> Could you provide a list of the thread names ?
>
> One way to remove those IPs that may be to rolling restart with
> -Dcassandra.load_ring_state=false i the JVM opts at the bottom of
> cassandra-env.sh
>
> The OutboundTcpConnection threads are created in pairs by the
> OutboundTcpConnectionPool, which is created here
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502
The
> threads are created in the OutboundTcpConnectionPool constructor checking
> to see if this could be the source of the leak.
>
> Cheers
>
>    -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 1/05/2013, at 2:18 AM, William Oberman <oberman@civicscience.com>
> wrote:
>
> I use phpcassa.
>
> I did a thread dump.  99% of the threads look very similar (I'm using
> 1.1.9 in terms of matching source lines).  The thread names are all like
> this: "WRITE-/10.x.y.z".  There are a LOT of duplicates (in terms of the
> same IP).  Many many many of the threads are trying to talk to IPs that
> aren't in the cluster (I assume they are the IP's of dead hosts).  The
> stack trace is basically the same for them all, attached at the bottom.
>
> There is a lot of things I could talk about in terms of my situation, but
> what I think might be pertinent to this thread: I hit a "tipping point"
> recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge
> (rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of
> the 9 keep struggling.  I've replaced them many times now, each time using
> this process:
> http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
> And even this morning the only two nodes with a high number of threads are
> those two (yet again).  And at some point they'll OOM.
>
> Seems like there is something about my cluster (caused by the recent
> upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't
> know how to escape from the trap.  Any ideas?
>
>
> --------
>   stackTrace = [ {
>     className = sun.misc.Unsafe;
>     fileName = Unsafe.java;
>     lineNumber = -2;
>      methodName = park;
>     nativeMethod = true;
>    }, {
>     className = java.util.concurrent.locks.LockSupport;
>     fileName = LockSupport.java;
>     lineNumber = 158;
>     methodName = park;
>     nativeMethod = false;
>    }, {
>     className =
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
>     fileName = AbstractQueuedSynchronizer.java;
>     lineNumber = 1987;
>     methodName = await;
>     nativeMethod = false;
>    }, {
>     className = java.util.concurrent.LinkedBlockingQueue;
>     fileName = LinkedBlockingQueue.java;
>     lineNumber = 399;
>     methodName = take;
>     nativeMethod = false;
>    }, {
>     className = org.apache.cassandra.net.OutboundTcpConnection;
>     fileName = OutboundTcpConnection.java;
>     lineNumber = 104;
>     methodName = run;
>     nativeMethod = false;
>    } ];
> ----------
>
>
>
>
> On Mon, Apr 29, 2013 at 4:31 PM, aaron morton <aaron@thelastpickle.com>wrote:
>
>>  I used JMX to check current number of threads in a production cassandra
>> machine, and it was ~27,000.
>>
>> That does not sound too good.
>>
>> My first guess would be lots of client connections. What client are you
>> using, does it do connection pooling ?
>> See the comments in cassandra.yaml around rpc_server_type, the default
>> uses sync uses one thread per connection, you may be better with HSHA. But
>> if your app is leaking connection you should probably deal with that first.
>>
>> Cheers
>>
>>    -----------------
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 30/04/2013, at 3:07 AM, William Oberman <oberman@civicscience.com>
>> wrote:
>>
>> Hi,
>>
>> I'm having some issues.  I keep getting:
>> ------------
>> ERROR [GossipStage:1] 2013-04-28 07:48:48,876
>> AbstractCassandraDaemon.java (line 135) Exception in thread
>> Thread[GossipStage:1,5,main]
>> java.lang.OutOfMemoryError: unable to create new native thread
>> --------------
>> after a day or two of runtime.  I've checked and my system settings seem
>> acceptable:
>> memlock=unlimited
>> nofiles=100000
>> nproc=122944
>>
>> I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS),
>> and I keep OOM'ing with the above error.
>>
>> I've found some (what seem to me) to be obscure references to the stack
>> size interacting with # of threads.  If I'm understanding it correctly, to
>> reason about Java mem usage I have to think of OS + Heap as being locked
>> down, and the stack gets the "leftovers" of physical memory and each thread
>> gets a stack.
>>
>> For me, the system ulimit setting on stack is 10240k (no idea if java
>> sees or respects this setting).  My -Xss for cassandra is the default (I
>> hope, don't remember messing with it) of 180k.  I used JMX to check current
>> number of threads in a production cassandra machine, and it was ~27,000.
>>  Is that a normal thread count?  Could my OOM be related to stack + number
>> of threads, or am I overlooking something more simple?
>>
>> will
>>
>>
>>
>
>
>
>
>
>

Mime
View raw message