Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E607F9E65 for ; Fri, 11 May 2012 16:03:18 +0000 (UTC) Received: (qmail 55122 invoked by uid 500); 11 May 2012 16:03:16 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 55096 invoked by uid 500); 11 May 2012 16:03:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 55088 invoked by uid 99); 11 May 2012 16:03:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2012 16:03:16 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of madalinaimatei@gmail.com designates 74.125.82.172 as permitted sender) Received: from [74.125.82.172] (HELO mail-we0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2012 16:03:08 +0000 Received: by werf13 with SMTP id f13so1513606wer.31 for ; Fri, 11 May 2012 09:02:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:mime-version:content-type:subject:date:in-reply-to:to :references:message-id:x-mailer; bh=PGj9js1dhjaiDHCeUKSd2hAwQAb6lNkM4yjRmjk0i4M=; b=tzxOhaVWWY1eQpaoZiQM5godGRwIObxesBy82Y7tYWqA0Ao+T0GdSJ0HPnqEq2RpT8 U7D0f+WrpljCo9N/SDYgU20D7dwvwBSp1bktFS4tDohK7JWHW9cO2dlMJlfBXyTkgeqN bnE6TzEa8liDpX7a2XWC47vFvfT6py4RGYZUhjNZedCYEC72VDDGIa6uHFiDov+wzrO/ nLf74UdhoewhXAL1EE1HKFgaVZCof6ypiCVKK/jsXNlQSib+4+mOd6aJDtjt1/BBg2h+ hzLCT3J3sleFiSZskz3tcppbZWbqaswwjoZ5txo4MaKn3dXHn1vsSuuiSzOc7gdcspYS lrwA== Received: by 10.180.85.69 with SMTP id f5mr9000539wiz.18.1336752168383; Fri, 11 May 2012 09:02:48 -0700 (PDT) Received: from [10.20.12.94] ([80.71.29.65]) by mx.google.com with ESMTPS id j3sm18436573wiw.1.2012.05.11.09.02.47 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 11 May 2012 09:02:47 -0700 (PDT) From: Madalina Matei Mime-Version: 1.0 (Apple Message framework v1257) Content-Type: multipart/alternative; boundary="Apple-Mail=_661D4DB8-E960-4251-AC69-894EFA341057" Subject: Re: Cassandra stucks Date: Fri, 11 May 2012 17:03:07 +0100 In-Reply-To: <3CC01B771AB0E149A76FEA4FCAC98226159BBA2F@smc-ex1.Enkata.com> To: user@cassandra.apache.org References: <3CC01B771AB0E149A76FEA4FCAC98226159BB9E9@smc-ex1.Enkata.com> <3CC01B771AB0E149A76FEA4FCAC98226159BBA09@smc-ex1.Enkata.com> <3CC01B771AB0E149A76FEA4FCAC98226159BBA2F@smc-ex1.Enkata.com> Message-Id: X-Mailer: Apple Mail (2.1257) --Apple-Mail=_661D4DB8-E960-4251-AC69-894EFA341057 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Check your JMX port in cassandra-env.sh and see if that's open.=20 Also if you have enabled=20 JVM_OPTS=3D"$JVM_OPTS -Djava.rmi.server.hostname=3D" and you are using an ip address for -Djava.rmi.server.hostname make sure = that is the correct ip. On 11 May 2012, at 16:42, Pavel Polushkin wrote: > No We are using dedicated phisical hardware. Currently we have 5 = nodes. > =20 > From: Madalina Matei [mailto:madalinaimatei@gmail.com]=20 > Sent: Friday, May 11, 2012 19:40 > To: user@cassandra.apache.org > Subject: Re: Cassandra stucks > =20 > Are you using EC2 ? > =20 > On 11 May 2012, at 16:13, Pavel Polushkin wrote: >=20 >=20 > We use 1.0.8 version. > =20 > From: David Leimbach [mailto:leimy2k@gmail.com]=20 > Sent: Friday, May 11, 2012 18:48 > To: user@cassandra.apache.org > Subject: Re: Cassandra stucks > =20 > What's the version number of Cassandra? >=20 > On Fri, May 11, 2012 at 7:38 AM, Pavel Polushkin = wrote: > Hello, >=20 > =20 >=20 > We faced with a strange problem while testing performance on Cassandra = cluster. After some time all nodes went to down state for several days. = Now all nodes went back to up state and only one node still down. >=20 > =20 >=20 > Nodetool on down node throws exception: >=20 > Error connection to remote JMX agent! >=20 > java.io.IOException: Failed to retrieve RMIServer stub: = javax.naming.CommunicationException [Root exception is = java.rmi.ConnectIOException: error during JRMP connection establishment; = nested exception is: >=20 > java.net.SocketTimeoutException: Read timed out] >=20 > at = javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340) >=20 > at = javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.ja= va:248) >=20 > at = org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144) >=20 > at = org.apache.cassandra.tools.NodeProbe.(NodeProbe.java:114) >=20 > at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623) >=20 > Caused by: javax.naming.CommunicationException [Root exception is = java.rmi.ConnectIOException: error during JRMP connection establishment; = nested exception is: >=20 > java.net.SocketTimeoutException: Read timed out] >=20 > at = com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101)= >=20 > at = com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:1= 85) >=20 > at javax.naming.InitialContext.lookup(InitialContext.java:392) >=20 > at = javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.ja= va:1888) >=20 > at = javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1= 858) >=20 > at = javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257) >=20 > ... 4 more >=20 > Caused by: java.rmi.ConnectIOException: error during JRMP connection = establishment; nested exception is: >=20 > java.net.SocketTimeoutException: Read timed out >=20 > at = sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286) >=20 > at = sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) >=20 > at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322) >=20 > at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source) >=20 > at = com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97) >=20 > ... 9 more >=20 > Caused by: java.net.SocketTimeoutException: Read timed out >=20 > at java.net.SocketInputStream.socketRead0(Native Method) >=20 > at java.net.SocketInputStream.read(SocketInputStream.java:129) >=20 > at = java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >=20 > at = java.io.BufferedInputStream.read(BufferedInputStream.java:237) >=20 > at java.io.DataInputStream.readByte(DataInputStream.java:248) >=20 > at = sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228) >=20 > ... 13 more >=20 > =20 >=20 > In system log of down node unlimited list of such errors: >=20 > INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804) = InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 = 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now = UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) = InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 = 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now = UP INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) = InetAddress /172.15.2.161 is now dead. >=20 > INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) = InetAddress /172.15.2.165 is now dead. >=20 > INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) = InetAddress /172.15.2.162 is now dead. >=20 > INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) = InetAddress /172.15.2.163 is now dead. >=20 > INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804) = InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 = 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now = UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) = InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 = 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now = UP >=20 > =20 >=20 > The suspicious fact is that on this node we have several tcp = connections to other nodes 7000 port in CLOSE_WAIT state: >=20 > Active Internet connections (servers and established) >=20 > Proto Recv-Q Send-Q Local Address Foreign Address = State >=20 > tcp 869073 0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274 = CLOSE_WAIT >=20 > tcp 463429 0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654 = CLOSE_WAIT >=20 > tcp 873838 0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486 = CLOSE_WAIT >=20 > tcp 860245 0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028 = CLOSE_WAIT >=20 > tcp 112 0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321 = CLOSE_WAIT >=20 > tcp 2124 0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338 = CLOSE_WAIT >=20 > tcp 0 0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408 = ESTABLISHED >=20 > tcp 184 0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862 = CLOSE_WAIT >=20 > tcp 534489 0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331 = ESTABLISHED >=20 > tcp 886 0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034 = CLOSE_WAIT >=20 > tcp 0 0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver = ESTABLISHED >=20 > tcp 0 0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348 = ESTABLISHED >=20 > tcp 187 0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538 = CLOSE_WAIT >=20 > tcp 253 0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359 = CLOSE_WAIT >=20 > =20 >=20 > Also I have attached thread dump >=20 > =20 >=20 > Thanks, >=20 > Pavel >=20 > =20 > =20 > =20 --Apple-Mail=_661D4DB8-E960-4251-AC69-894EFA341057 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii Check your JMX port in cassandra-env.sh and see if = that's open. 

Also if you have = enabled 

 JVM_OPTS=3D"$JVM_OPTS = -Djava.rmi.server.hostname=3D<public = name>"

and you are using an ip address = for -Djava.rmi.server.hostname make sure that is the correct = ip.


On 11 May 2012, at = 16:42, Pavel Polushkin wrote:

No We are using dedicated phisical hardware. = Currently we have 5 nodes.
From: Madalina Matei = [mailto:madalinaimatei@gmail.com] 
Sent: Friday, May 11, 2012 = 19:40
To:  
Re: Cassandra = stucks
What's the version number of = Cassandra?

On Fri, May 11, 2012 = at 7:38 AM, Pavel Polushkin <Hello,

 

We faced with a strange problem while testing performance on = Cassandra cluster. After some time all nodes went to down state for = several days. Now all nodes went back to up state and only one node = still down.

Nodetool on = down node throws exception:

Error connection to = remote JMX agent!

java.io.IOException: = Failed to retrieve RMIServer stub: javax.naming.CommunicationException = [Root exception is java.rmi.ConnectIOException: error during JRMP = connection establishment; nested exception is:

        = java.net.SocketTimeoutException: Read timed out]

        at = javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)

        at = org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114)

Caused by: javax.naming.CommunicationException [Root = exception is java.rmi.ConnectIOException: error during JRMP connection = establishment; nested exception is:

        = java.net.SocketTimeoutException: Read timed out]

        at = com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101)=

        at = javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.ja= va:1888)

Caused by: = java.rmi.ConnectIOException: error during JRMP connection establishment; = nested exception is:

        at = sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286)

        at = sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown = Source)

Caused by: = java.net.SocketTimeoutException: Read timed out

        at = java.net.SocketInputStream.socketRead0(Native = Method)

        at = java.io.BufferedInputStream.fill(BufferedInputStream.java:218)=

        at = java.io.BufferedInputStream.read(BufferedInputStream.java:237)=

        at = java.io.DataInputStream.readByte(DataInputStream.java:248)

        at = sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)

In system log = of down node unlimited list of such errors:

INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java = (line 804) InetAddress / is = now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line = 804) InetAddress / is = now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line = 804) InetAddress / is = now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line = 804) InetAddress / is = now UP INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line = 818) InetAddress / is = now dead.

INFO [GossipTasks:1] = 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.165 is now = dead.

INFO [GossipTasks:1] 2012-05-10 = 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.162 is now = dead.

INFO [GossipTasks:1] 2012-05-10 = 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.163 is now = dead.

INFO [GossipStage:1] 2012-05-10 = 23:18:29,291 Gossiper.java (line 804) InetAddress /172.15.2.161 is now UP INFO = [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) = InetAddress / is = now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line = 804) InetAddress / is = now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line = 804) InetAddress / is = now UP

The suspicious = fact is that on this node we have several tcp connections to other nodes = 7000 port in CLOSE_WAIT state:

Active Internet = connections (servers and established)

Proto Recv-Q Send-Q Local = Address           = Foreign Address         = State

tcp   = 869073      0 rcwocas:afs3-fileserver = rcwocas03.enkata.:34274 CLOSE_WAIT

tcp   463429      0 = rcwocas:afs3-fileserver rcwocas02.enkata.:39654 = CLOSE_WAIT

tcp   = 873838      0 rcwocas:afs3-fileserver = rcwocas01.enkata.:49486 CLOSE_WAIT

tcp   860245      0 = rcwocas:afs3-fileserver rcwocas05.enkata.:43028 = CLOSE_WAIT

tcp      = 184      0 rcwocas:afs3-fileserver = rcwocas01.enkata.:48862 CLOSE_WAIT

tcp   534489      0 = rcwocas:afs3-fileserver rcwocas02.enkata.:35331 = ESTABLISHED

tcp        = 0      0 rcwocas:afs3-fileserver = rcwocas01.enkata.:51348 ESTABLISHED

tcp      = 187      0 rcwocas:afs3-fileserver = rcwocas05.enkata.:45538 CLOSE_WAIT

tcp      = 253      0 rcwocas:afs3-fileserver = rcwocas03.enkata.:51359 CLOSE_WAIT

 

Also I have attached thread dump

 

Thanks,

Pavel


= --Apple-Mail=_661D4DB8-E960-4251-AC69-894EFA341057--