Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 6317 invoked from network); 22 Oct 2010 19:49:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Oct 2010 19:49:56 -0000 Received: (qmail 72698 invoked by uid 500); 22 Oct 2010 19:49:54 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 72585 invoked by uid 500); 22 Oct 2010 19:49:54 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 72577 invoked by uid 99); 22 Oct 2010 19:49:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Oct 2010 19:49:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bill.w.au@gmail.com designates 209.85.215.44 as permitted sender) Received: from [209.85.215.44] (HELO mail-ew0-f44.google.com) (209.85.215.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Oct 2010 19:49:47 +0000 Received: by ewy27 with SMTP id 27so785707ewy.31 for ; Fri, 22 Oct 2010 12:49:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=Nz2B0Vk6z/j4ne7GR5g6Ky/wmp2+35kkA3kFWC0+itY=; b=Qt4C/EfikItfFXB8qaWxqJrDOBh3zSfzvtjzkzd+M8ZuJ5p+YsME4u4LaEn1T1b6Vr 8wwDrfLcz7cT4TBkuZyAGfS+BKsjNuYbl0R2P6unKWq9jTQ8LmsbH5iwj9nqRC3PTk/T M2/aOVEe2xDoVQik6TU4Q8nMQXaEWzV4muM5Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=qFLQxUwjGAUTZh10CD9mG9Y02ZJtHJ3uaGBNJgkNFc6R6VNZDuvXIcCG4gEFPPzxve HiD4yuNrF/j0X0cA6yliQYPh8gp9/vYrmn6WrwDsy46f5bKW20Ah/EkISJx5dsz0f0Z4 RZTfABjWkl4I7Vcldi2QjBQBTAQeAYSo4HC9Q= MIME-Version: 1.0 Received: by 10.213.7.76 with SMTP id c12mr9191803ebc.70.1287776967559; Fri, 22 Oct 2010 12:49:27 -0700 (PDT) Received: by 10.213.108.132 with HTTP; Fri, 22 Oct 2010 12:49:27 -0700 (PDT) In-Reply-To: References: Date: Fri, 22 Oct 2010 15:49:27 -0400 Message-ID: Subject: Re: Cassandra crashed - possible JMX threads leak From: Bill Au To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0015174c1332b38aa9049339f0e5 X-Virus-Checked: Checked by ClamAV on apache.org --0015174c1332b38aa9049339f0e5 Content-Type: text/plain; charset=ISO-8859-1 Not with the nodeprobe or nodetool command because the JVM these two commands spawn has a very short life span. I am using a webapp to monitor my cassandra cluster. It pretty much uses the same code as NodeCmd class. For each incoming request, it creates an NodeProbe object and use it to get get various status of the cluster. I can reproduce the Cassandra JVM crash by issuing requests to this webapp in a bash while loop. I took a deeper look and here is what I discovered: In the webapp when NodeProbe creates a JMXConnector to connect to the Cassandra JMX port, a thread (com.sun.jmx.remote.internal.ClientCommunicatorAdmin$Checker) is created and run in the webapp's JVM. Meanwhile in the Cassamdra JVM there is a com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout thread to timeout remote JMX connection. However, since NodeProbe does not call JMXConnector.close(), the JMX client checker threads remains in the webapp's JVM even after the NobeProbe object has been garbage collected. So this JMX connection is still considered open and that keeps the JMX timeout thread running inside the Cassandra JVM. The number of JMX client checker threads in my webapp's JVM matches up with the number of JMX server timeout threads in my Cassandra's JVM. If I stop my webapp's JVM, all the JMX server timeout threads in my Cassandra's JVM all disappear after 2 minutes, the default timeout for a JMX connection. This is why the problem cannot be reproduced by nodeprobe or nodetool. Even though JMXConnector.close() is not called, the JVM exits shortly so the JMX client checker thread do not stay around. So their corresponding JMX server timeout thread goes away after two minutes. This is not the case with my weabpp since its JVM keeps running, so all the JMX client checker threads keep running as well. The threads keep piling up until it crashes Casssandra's JVM. In my case I think I can change my webapp to use a static NodeProbe instead of creating a new one for every request. That should get around the leak. However, I have seen the leak occurs in another situation. On more than one occasions when I restarted one node in a live multi-node clusters, I see that the JMX server timeout threads quickly piled up (number in the thousands) in Cassandra's JVM. It only happened on a live cluster that is servicing read and write requests. I am guessing the hinted hand off might have something to do with it. I am still trying to understand what is happening there. Bill On Wed, Oct 20, 2010 at 5:16 PM, Jonathan Ellis wrote: > can you reproduce this by, say, running nodeprobe ring in a bash while > loop? > > On Wed, Oct 20, 2010 at 3:09 PM, Bill Au wrote: > > One of my Cassandra server crashed with the following: > > > > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419 > > CassandraDaemon.java (line 82) Uncaught exception in thread > > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main] > > java.lang.OutOfMemoryError: unable to create new native thread > > at java.lang.Thread.start0(Native Method) > > at java.lang.Thread.start(Thread.java:597) > > at > > > org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533) > > > > > > I took threads dump in the JVM on all the other Cassandra severs in my > > cluster. They all have thousand of threads looking like this: > > > > "JMX server connection timeout 183373" daemon prio=10 > tid=0x00002aad230db800 > > nid=0x5cf6 in Object.wait() [0x00002aad7a316000] > > java.lang.Thread.State: TIMED_WAITING (on object monitor) > > at java.lang.Object.wait(Native Method) > > at > > > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150) > > - locked <0x00002aab056ccee0> (a [I) > > at java.lang.Thread.run(Thread.java:619) > > > > It seems to me that there is a JMX threads leak in Cassandra. NodeProbe > > creates a JMXConnector but never calls its close() method. I tried > setting > > jmx.remote.x.server.connection.timeout to 0 hoping that would disable the > > JMX server connection timeout threads. But that did not make any > > difference. > > > > Has anyone else seen this? > > > > Bill > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com > --0015174c1332b38aa9049339f0e5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Not with the nodeprobe or nodetool command because the JVM these two comman= ds spawn has a very short life span.

I am using a webapp to monitor = my cassandra cluster.=A0 It pretty much uses the same code as NodeCmd class= .=A0 For each incoming request, it creates an NodeProbe object and use it t= o get get various status of the cluster.=A0 I can reproduce the Cassandra J= VM crash by issuing requests to this webapp in a bash while loop.=A0 I took= a deeper look and here is what I discovered:

In the webapp when NodeProbe creates a JMXConnector to connect to the C= assandra JMX port, a thread (com.sun.jmx.remote.internal.ClientCommunicator= Admin$Checker) is created and run in the webapp's JVM.=A0 Meanwhile in = the Cassamdra JVM there is a com.sun.jmx.remote.internal.ServerCommunicator= Admin$Timeout thread to timeout remote JMX connection.=A0 However, since No= deProbe does not call JMXConnector.close(), the JMX client checker threads = remains in the webapp's JVM even after the NobeProbe object has been ga= rbage collected.=A0 So this JMX connection is still considered open and tha= t keeps the JMX timeout thread running inside the Cassandra JVM.=A0 The num= ber of JMX client checker threads in my webapp's JVM matches up with th= e number of JMX server timeout threads in my Cassandra's JVM.=A0 If I s= top my webapp's JVM,
all the JMX server timeout threads in my Cassandra's JVM all disappear = after 2 minutes, the default timeout for a JMX connection.=A0 This is why t= he problem cannot be reproduced by nodeprobe or nodetool.=A0 Even though JM= XConnector.close() is not called, the JVM exits shortly so the JMX client c= hecker thread do not stay around.=A0 So their corresponding JMX server time= out thread goes away after two minutes.=A0 This is not the case with my wea= bpp since its JVM keeps running, so all the JMX client checker threads keep= running as well.=A0 The threads keep piling up until it crashes Casssandra= 's JVM.

In my case I think I can change my webapp to use a static NodeProbe ins= tead of creating a new one for every request.=A0 That should get around the= leak.

However, I have seen the leak occurs in another situation.=A0= On more than one occasions when I restarted one node in a live multi-node = clusters, I see that the JMX server timeout threads quickly piled up (numbe= r in the thousands) in Cassandra's JVM.=A0 It only happened on a live c= luster that is servicing read and write requests.=A0 I am guessing the hint= ed hand off might have something to do with it.=A0 I am still trying to und= erstand what is happening there.

Bill


On Wed, Oct 20, 2010 at 5:16= PM, Jonathan Ellis <jbellis@gmail.com> wrote:
can you reproduce this by, say, running nodeprobe ring in a bash while loop= ?

On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bill.w.au@gmail.com> wrote:
> One of my Cassandra server crashed with the following:
>
> ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
> CassandraDaemon.java (line 82) Uncaught exception in thread
> Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
> java.lang.OutOfMemoryError: unable to create new native thread
> =A0=A0=A0=A0=A0=A0=A0 at java.lang.Thread.start0(Native Method)
> =A0=A0=A0=A0=A0=A0=A0 at java.lang.Thread.start(Thread.java:597)
> =A0=A0=A0=A0=A0=A0=A0 at
> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingSe= rvice.java:533)
>
>
> I took threads dump in the JVM on all the other Cassandra severs in my=
> cluster.=A0 They all have thousand of threads looking like this:
>
> "JMX server connection timeout 183373" daemon prio=3D10 tid= =3D0x00002aad230db800
> nid=3D0x5cf6 in Object.wait() [0x00002aad7a316000]
> =A0=A0 java.lang.Thread.State: TIMED_WAITING (on object monitor)
> =A0=A0=A0=A0=A0=A0=A0 at java.lang.Object.wait(Native Method)
> =A0=A0=A0=A0=A0=A0=A0 at
> com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(Server= CommunicatorAdmin.java:150)
> =A0=A0=A0=A0=A0=A0=A0 - locked <0x00002aab056ccee0> (a [I)
> =A0=A0=A0=A0=A0=A0=A0 at java.lang.Thread.run(Thread.java:619)
>
> It seems to me that there is a JMX threads leak in Cassandra.=A0 NodeP= robe
> creates a JMXConnector but never calls its close() method.=A0 I tried = setting
> jmx.remote.x.server.connection.timeout to 0 hoping that would disable = the
> JMX server connection timeout threads.=A0 But that did not make any > difference.
>
> Has anyone else seen this?
>
> Bill
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

--0015174c1332b38aa9049339f0e5--