Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 26277 invoked from network); 21 Oct 2010 05:54:19 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Oct 2010 05:54:19 -0000 Received: (qmail 93700 invoked by uid 500); 21 Oct 2010 05:54:18 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 93661 invoked by uid 500); 21 Oct 2010 05:54:18 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 93652 invoked by uid 99); 21 Oct 2010 05:54:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Oct 2010 05:54:17 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.216.172] (HELO mail-qy0-f172.google.com) (209.85.216.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Oct 2010 05:54:11 +0000 Received: by qyk1 with SMTP id 1so333604qyk.10 for ; Wed, 20 Oct 2010 22:53:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.214.194 with SMTP id hb2mr339898qcb.234.1287640428170; Wed, 20 Oct 2010 22:53:48 -0700 (PDT) Received: by 10.229.24.5 with HTTP; Wed, 20 Oct 2010 22:53:48 -0700 (PDT) In-Reply-To: <2ef6ae54-043b-e3e8-9c39-805f409c101f@me.com> References: <2ef6ae54-043b-e3e8-9c39-805f409c101f@me.com> Date: Wed, 20 Oct 2010 23:53:48 -0600 Message-ID: Subject: Re: Cassandra crashed - possible JMX threads leak From: Frank LoVecchio To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016363108ed51d02304931a2667 X-Virus-Checked: Checked by ClamAV on apache.org --0016363108ed51d02304931a2667 Content-Type: text/plain; charset=ISO-8859-1 Ah, missed that. Thanks Aaron. On Wed, Oct 20, 2010 at 11:49 PM, Aaron Morton wrote: > Sounds like the problem discussed here > http://wiki.apache.org/cassandra/Operations?highlight=(allocate)|(memory) > > > If > you have the JNA jar it should work > http://www.mail-archive.com/user@cassandra.apache.org/msg06284.html > > http://www.riptano.com/blog/whats-new-cassandra-065 > > Aaron > > > On 21 Oct, 2010,at 06:29 PM, Frank LoVecchio wrote: > > I have a cluster of 3 0.7 beta 2 nodes (built today from the latest trunk) > running on Large, EBS-backed, x64 EC2 instances; RF=3. I attempted to write > somewhere near 500,000 records every 15 minutes from a total of 5 different > computers (using Pelops and multi-threading). Though my network blew up > and I'm not quite sure how many records were inserted, I lost a node a > couple hours later, and the other 2 were at severely high memory useage. Is > this a memory leak of some kind, or something I can configure / watch for in > the future? > > A nodetool does this: > > [ec2-user@XXX bin]$ ./nodetool -h localhost ring > Address Status State Load Token > > XXX > ipXXX Down Normal 564.76 MB XXX > ipXXX Up Normal 564.83 MB XXX > ipXXX Up Normal 563.06 MB XXX > > A top on the box that is down shows this: (dual core x64) > > Cpu(s): 19.9%us, 5.9%sy, 0.0%ni, 8.8%id, 57.3%wa, 0.0%hi, 0.0%si, > 8.1%st > Mem: 7651528k total, 7611112k used, 40416k free, 66056k buffers > Swap: 0k total, 0k used, 0k free, 3294076k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 22514 root 20 0 5790m 4.0g 167m S 91.9 54.8 152:45.08 java > > > I see this error in the log file: > > ERROR [CompactionExecutor:1] 2010-10-21 01:35:05,318 > AbstractCassandraDaemon.java (line 88) Fatal exception in thread > Thread[CompactionExecutor:1,1,main] > java.io.IOError: java.io.IOException: Cannot run program "ln": > java.io.IOException: error=12, Cannot allocate memory > at > org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1368) > at org.apache.cassandra.db.Table.snapshot(Table.java:163) > at > org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:232) > at > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:106) > at > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:84) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:636) > Caused by: java.io.IOException: Cannot run program "ln": > java.io.IOException: error=12, Cannot allocate memory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:475) > at > org.apache.cassandra.io.util.FileUtils.createHardLinkWithExec(FileUtils.java:263) > at > org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:229) > at > org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1360) > ... 9 more > Caused by: java.io.IOException: java.io.IOException: error=12, Cannot > allocate memory > at java.lang.UNIXProcess.(UNIXProcess.java:164) > at java.lang.ProcessImpl.start(ProcessImpl.java:81) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:468) > ... 12 more > > > On Wed, Oct 20, 2010 at 3:16 PM, Jonathan Ellis wrote: > >> can you reproduce this by, say, running nodeprobe ring in a bash while >> loop? >> >> >> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au wrote: >> > One of my Cassandra server crashed with the following: >> > >> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnnnnn.nnn] 2010-10-19 00:25:10,419 >> >> > CassandraDaemon.java (line 82) Uncaught exception in thread >> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main] >> > java.lang.OutOfMemoryError: unable to create new native thread >> > at java.lang.Thread.start0(Native Method) >> > at java.lang.Thread.start(Thread.java:597) >> > at >> > >> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533) >> > >> > >> > I took threads dump in the JVM on all the other Cassandra severs in my >> > cluster. They all have thousand of threads looking like this: >> > >> > "JMX server connection timeout 183373" daemon prio=10 >> tid=0x00002aad230db800 >> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000] >> > java.lang.Thread.State: TIMED_WAITING (on object monitor) >> > at java.lang.Object.wait(Native Method) >> > at >> > >> com.sunjmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150) >> >> > - locked <0x00002aab056ccee0> (a [I) >> > at java.lang.Thread.run(Thread.java:619) >> > >> > It seems to me that there is a JMX threads leak in Cassandra. NodeProbe >> > creates a JMXConnector but never calls its close() method. I tried >> setting >> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable >> the >> > JMX server connection timeout threads. But that did not make any >> > difference. >> > >> > Has anyone else seen this? >> > >> > Bill >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of Riptano, the source for professional Cassandra support >> http://riptano.com >> > > --0016363108ed51d02304931a2667 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Ah, missed that.

Thanks Aaron.

On Wed, Oct 20, 2010 at 11:49 PM, Aaron Morton &= lt;aaron@thelastpickle.com&g= t; wrote:

If you have the JN= A jar it should work=A0

Aaron


On 21 Oct, 2010,at 06:29 PM, Frank LoVecchio <frank@isidorey.com> wrot= e:

<= div class=3D"h5">I have a cluster of 3 0.7 beta 2 nodes (built today from t= he latest trunk) running on Large, EBS-backed, x64 EC2 instances; RF=3D3. = =A0I attempted to write somewhere near 500,000 records every 15 minutes fro= m a total of 5 different computers (using Pelops and multi-threading). =A0 = Though my network blew up and I'm not quite sure how many records were = inserted, I lost a node a couple hours later, and the other 2 were at sever= ely high memory useage. =A0Is this a memory leak of some kind, or something= I can configure / watch for in the future?

A nodetool does this:

[ec2-user@XXX bin]= $ ./nodetool -h localhost ring
=A0=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0XXX=A0=A0
ipXXX =A0 Down =A0 Normal =A0564.7= 6 MB =A0 =A0 =A0 XXX=A0=A0 =A0 =A0
ipXXX =A0 Up =A0 =A0 =A0 Normal =A0564.83 MB =A0 =A0 =A0 XXX=A0=A0 =A0=A0
i= pXXX =A0 Up =A0 =A0 =A0 Normal =A0563.06 MB =A0 =A0 =A0 XXX=A0=A0 =A0=A0

A top on the box that is down shows this: (dual core x64)
Cpu(s): 19.9%us, =A05.9%sy, =A00.0%ni, =A08.8%id, 57.3%wa, =A00.0%hi, = =A00.0%si, =A08.1%st
Mem: =A0 7651528k total, =A07611112k used, =A0 =A040416k free, =A0 =A0= 66056k buffers
Swap: =A0 =A0 =A0 =A00k total, =A0 =A0 =A0 =A00k u= sed, =A0 =A0 =A0 =A00k free, =A03294076k cached

PI= D USER =A0 =A0 =A0PR =A0NI =A0VIRT =A0RES =A0SHR S %CPU %MEM =A0 =A0TIME+ = =A0COMMAND =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0
22514 root =A0 =A0 =A020 =A0 0 5790m 4.0g 167m S =A0 =A091.9 =A0 =A0 = =A0 =A054.8 152:45.08 java =A0 =A0 =A0 =A0 =A0=A0
I see this error = in the log file:

ERROR [CompactionExecutor:1] 2010-10-21 01:35:05,318 AbstractCassandra= Daemon.java (line 88) Fatal exception in thread Thread[CompactionExecutor:1= ,1,main]
java.io.IOError: java.io.IOException: Cannot run program "ln"= ;: java.io.IOException: error=3D12, Cannot allocate memory
at org.apache.cassandra.db.ColumnFam= ilyStore.snapshot(ColumnFamilyStore.java:1368)
at org.apache.cassandra.d= b.Table.snapshot(Table.java:163)
at org.apache.cassandra.db.CompactionManager.doCompaction(Comp= actionManager.java:232)
at org.apache.cassandra.d= b.CompactionManager$1.call(CompactionManager.java:106)
at org.apache.cassandra.db.CompactionMan= ager$1.call(CompactionManager.java:84)
at java.util.concurrent.F= utureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.= java:166)
at java.util.concurrent.T= hreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExe= cutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(T= hread.java:636)
Caused by: java.io.IOException: Cannot run progra= m "ln": java.io.IOException: error=3D12, Cannot allocate memory
at java.lang.ProcessBuild= er.start(ProcessBuilder.java:475)
at org.apache.cassandra.io.util.FileUtils.createHardLinkWithE= xec(FileUtils.java:263)
at org.apache.cassandra.i= o.util.FileUtils.createHardLink(FileUtils.java:229)
at org.apache.cassandra.db.ColumnFamilySt= ore.snapshot(ColumnFamilyStore.java:1360)
... 9 more
Caus= ed by: java.io.IOException: java.io.IOException: error=3D12, Cannot allocat= e memory
at java.lan= g.UNIXProcess.<init>(UNIXProcess.java:164)
at java.lang.ProcessImpl.= start(ProcessImpl.java:81)
= at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
... 12 more
=A0

On Wed, Oct 20, 2010 at 3:16 PM, Jonathan Ellis &= lt;jbellis@gmail.com= > wrote:
can you reproduce this by, say, running nodeprobe ring in a bash while lo= op?


On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bill.w.au@gmail.com> wrote:
> One of my Cassandra server crashed with the following:
>
> ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnnnnn.nnn] 2010-10-19 00:25:10,419

> CassandraDaemon.java (line 82) Uncaught exception in thread
> Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
> java.lang.OutOfMemoryError: unable to create new native thread
> =A0=A0=A0=A0=A0=A0=A0 at java.lang.Thread.start0(Native Method)
> =A0=A0=A0=A0=A0=A0=A0 at java.lang.Thread.start(Thread.java:597)
> =A0=A0=A0=A0=A0=A0=A0 at
> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingSe= rvice.java:533)
>
>
> I took threads dump in the JVM on all the other Cassandra severs in my=
> cluster.=A0 They all have thousand of threads looking like this:
>
> "JMX server connection timeout 183373" daemon prio=3D10 tid= =3D0x00002aad230db800
> nid=3D0x5cf6 in Object.wait() [0x00002aad7a316000]
> =A0=A0 java.lang.Thread.State: TIMED_WAITING (on object monitor)
> =A0=A0=A0=A0=A0=A0=A0 at java.lang.Object.wait(Native Method)
> =A0=A0=A0=A0=A0=A0=A0 at
> com.sunjmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerC= ommunicatorAdmin.java:150)

> =A0=A0=A0=A0=A0=A0=A0 - locked <0x00002aab056ccee0> (a [I)
> =A0=A0=A0=A0=A0=A0=A0 at java.lang.Thread.run(Thread.java:619)
>
> It seems to me that there is a JMX threads leak in Cassandra.=A0 NodeP= robe
> creates a JMXConnector but never calls its close() method.=A0 I tried = setting
> jmx.remote.x.server.connection.timeout to 0 hoping that would disable = the
> JMX server connection timeout threads.=A0 But that did not make any > difference.
>
> Has anyone else seen this?
>
> Bill
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


--0016363108ed51d02304931a2667--