incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pavel Polushkin" <ppolush...@enkata.com>
Subject RE: Cassandra stucks
Date Mon, 14 May 2012 13:11:45 GMT
Thank you for your reply Aaron!

 

Here is more details about our environment:

1.      OS - Debian 6.0.4

root@rcwocas05:~# uname -a

Linux rcwocas05 2.6.32-5-amd64 #1 SMP Thu Mar 22 17:26:33 UTC 2012
x86_64 GNU/Linux

2.      Java

root@rcwocas05:~# java -version

java version "1.6.0_26"

Java(TM) SE Runtime Environment (build 1.6.0_26-b03)

Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

3.      Cassandra 

Datastax Enterprise - dse 2.0.1, built on Cassandra 1.0.8

 

Now I will try to answer your questions:

Q: Where are you running nodetool from ?

A: From the same node

 

Q: Do you have any switches or firewalls between the servers ? Could
they be closing connections?

A: All nodes are on the same switch without firewalls. We are using
nodetool -h localhost ring, also cluster configured by IP addresses, so
it's seems that DNS should not be the cause of this.

 

We didn't find network problems on cluster. Here is ifconfig that shows
that there is no error or dropped packets: 

root@rcwocas05:~# ifconfig

eth1      Link encap:Ethernet  HWaddr 00:25:90:63:16:c8

          inet addr:172.15.2.165  Bcast:172.15.2.255  Mask:255.255.255.0

          inet6 addr: fe80::225:90ff:fe63:16c8/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1

          RX packets:573854310 errors:0 dropped:0 overruns:0 frame:0

          TX packets:777260637 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:5649087010357 (5.1 TiB)  TX bytes:4024634337167 (3.6
TiB)

 

lo        Link encap:Local Loopback

          inet addr:127.0.0.1  Mask:255.0.0.0

          inet6 addr: ::1/128 Scope:Host

          UP LOOPBACK RUNNING  MTU:16436  Metric:1

          RX packets:926558 errors:0 dropped:0 overruns:0 frame:0

          TX packets:926558 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:220573098 (210.3 MiB)  TX bytes:220573098 (210.3 MiB)

 

Seems that network working fine, we tested ping then several nodes were
in down state and didn't find any connection errors.

 

Q: Anything interesting with your networking setup

A: We using standard Cassandra configuration. The only think we
configured is IP's of cluster nodes.

 

 

I would like to give more details:

 

1.      Attaching Cassandra node thread dump when it in down state. You
can find that all "network" threads are in WAITING state. Here is
example of such threads:

"Thrift:730" daemon prio=10 tid=0x00007f15bc050800 nid=0x30c5 waiting on
condition [0x00007f934e75b000]

   java.lang.Thread.State: WAITING (parking)

                at sun.misc.Unsafe.park(Native Method)

                - parking to wait for  <0x00000006810ec428> (a
java.util.concurrent.SynchronousQueue$TransferStack)

                at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

                at
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Synchro
nousQueue.java:422)

                at
java.util.concurrent.SynchronousQueue$TransferStack.transfer(Synchronous
Queue.java:323)

                at
java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:857)

                at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:
947)

                at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:907)

                at java.lang.Thread.run(Thread.java:662)

"WRITE-/172.15.2.162" daemon prio=10 tid=0x0000000000fc8800 nid=0x1bb4
waiting on condition [0x00007f934d082000]

   java.lang.Thread.State: WAITING (parking)

                at sun.misc.Unsafe.park(Native Method)

                - parking to wait for  <0x0000000681115318> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

                at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

                at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.aw
ait(AbstractQueuedSynchronizer.java:1987)

                at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:3
99)

                at
org.apache.cassandra.net.OutboundTcpConnection.take(OutboundTcpConnectio
n.java:198)

                at
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection
.java:90)

...

 

Here is logs from two nodes: Node1 and node 5 when Node5 goes to down
state: Node 5 contains exceptions while Node 1 only diagnose that Node 5
went to down state.

 

NODE 5 LOG:
---
DEBUG [Thrift:873] 2012-05-14 05:44:03,066 StorageProxy.java (line212)
Write timeout java.util.concurrent.TimeoutException for one (or more)
of:
DEBUG [Thrift:873] 2012-05-14 05:44:03,066 CassandraServer.java
(line648) ... timed out
DEBUG [Thrift:873] 2012-05-14 05:44:03,066 CustomTThreadPoolServer.java
(line 260) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException: Cannot read. Remote
side has closed. Tried to read 4 bytes, but only got 0 bytes. (This is
often indicative of an internal error on the server side. Please check
your server logs.)
       at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
       at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:
378)
       at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:
297)
       at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProto
col.java:204)
       at
org.apache.cassandra.thrift.Dse$Processor.process(Dse.java:908)
       at
com.datastax.bdp.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.
run(CustomTThreadPoolServer.java:250)
       at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
r.java:886)
       at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:908)
       at java.lang.Thread.run(Thread.java:662)
DEBUG [Thread-3010] 2012-05-14 05:44:12,007 Gossiper.java (line 222)
Reseting version for /172.15.2.164 <http://172.15.2.164/> 
 INFO [GossipTasks:1] 2012-05-14 05:44:12,135 Gossiper.java (line 818)
InetAddress /172.15.2.161 <http://172.15.2.161/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:12,135 MessagingService.java
(line 221) Resetting pool for /172.15.2.161 <http://172.15.2.161/> 
 INFO [GossipTasks:1] 2012-05-14 05:44:12,135 Gossiper.java (line 818)
InetAddress /172.15.2.162 <http://172.15.2.162/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:12,135 MessagingService.java
(line 221) Resetting pool for /172.15.2.162 <http://172.15.2.162/> 
 INFO [GossipTasks:1] 2012-05-14 05:44:12,136 Gossiper.java (line 818)
InetAddress /172.15.2.163 <http://172.15.2.163/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:12,136 MessagingService.java
(line 221) Resetting pool for /172.15.2.163 <http://172.15.2.163/> 
DEBUG [Thread-3008] 2012-05-14 05:44:12,625 Gossiper.java (line 222)
Reseting version for /172.15.2.163 <http://172.15.2.163/> 
DEBUG [Thread-3007] 2012-05-14 05:44:12,626 Gossiper.java (line 222)
Reseting version for /172.15.2.161 <http://172.15.2.161/> 
DEBUG [Thread-3009] 2012-05-14 05:44:12,682 Gossiper.java (line 222)
Reseting version for /172.15.2.162 <http://172.15.2.162/> 
DEBUG [WRITE-/172.15.2.163 <http://172.15.2.163/> ] 2012-05-14
05:44:13,136 OutboundTcpConnection.java (line 211) attempting to connect
to
/172.15.2.163 <http://172.15.2.163/>  DEBUG [WRITE-/172.15.2.161
<http://172.15.2.161/> ] 2012-05-14 05:44:13,136
OutboundTcpConnection.java (line 211) attempting to connect to
/172.15.2.161 <http://172.15.2.161/>   INFO [GossipTasks:1] 2012-05-14
05:44:14,136 Gossiper.java (line 818) InetAddress /172.15.2.164
<http://172.15.2.164/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:14,137 MessagingService.java
(line 221) Resetting pool for /172.15.2.164 <http://172.15.2.164/> 
DEBUG [WRITE-/172.15.2.164 <http://172.15.2.164/> ] 2012-05-14
05:44:15,137 OutboundTcpConnection.java (line 211) attempting to connect
to
/172.15.2.164 <http://172.15.2.164/>  DEBUG [WRITE-/172.15.2.162
<http://172.15.2.162/> ] 2012-05-14 05:44:17,138
OutboundTcpConnection.java (line 211) attempting to connect to

--------
NODE 1 LOG:
-------
DEBUG [MutationStage:7771] 2012-05-14 05:44:03,068 Table.java (line 387)
applying mutation of row
32303132303231305f677569645f66697273745f31303634323231 DEBUG
[Thrift:317] 2012-05-14 05:44:03,068 StorageProxy.java (line
212) Write timeout java.util.concurrent.TimeoutException for one (or
more) of: DEBUG [Thrift:317] 2012-05-14 05:44:03,068
CassandraServer.java (line 648) ... timed out
DEBUG [Thrift:317] 2012-05-14 05:44:03,069 CustomTThreadPoolServer.java
(line 260) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException: Cannot read. Remote
side has closed. Tried to read 4 bytes, but only got 0 bytes. (This is
often indicative of an internal error on the server side. Please check
your server logs.)
       at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
       at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:
378)
       at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:
297)
       at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProto
col.java:204)
       at
org.apache.cassandra.thrift.Dse$Processor.process(Dse.java:908)
       at
com.datastax.bdp.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.
run(CustomTThreadPoolServer.java:250)
       at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
r.java:886)
       at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:908)
       at java.lang.Thread.run(Thread.java:662)
DEBUG [Thrift:317] 2012-05-14 05:44:03,070 ClientState.java (line 104)
logged out: #<User allow_all groups=[]>
DEBUG [RequestResponseStage:7161] 2012-05-14 05:44:03,073
ResponseVerbHandler.java (line 44) Processing response on a callback
from 3318177@/172.15.2.164 <http://172.15.2.164/>  DEBUG
[MutationStage:7772] 2012-05-14 05:44:03,079
RowMutationVerbHandler.java (line 44) Applying
RowMutation(keyspace='Performance90',
key='32303132303231305f677569645f66697273745f3130363 4323230',
modifications=[ColumnFamily(VideoMedia20120210
[file:false:31@1336999691332000,media:false:1048576@1336999691332001,])]
) 
DEBUG [MutationStage:7772] 2012-05-14 05:44:03,079 Table.java (line 387)
applying mutation of row
32303132303231305f677569645f66697273745f31303634323230
DEBUG [MutationStage:7772] 2012-05-14 05:44:03,081
RowMutationVerbHandler.java (line 56)
RowMutation(keyspace='Performance90',key='32303132303231305f677569645f66
697273745f31303634323230',
modifications=[ColumnFamily(VideoMedia20120210[file:false:31@13369996913
32000,media:false:1048576@1336999691332001,])])
applied.  Sending response to 3167090@/172.15.2.162
<http://172.15.2.162/> 
DEBUG [MutationStage:7773] 2012-05-14 05:44:03,084
RowMutationVerbHandler.java (line 44) Applying
RowMutation(keyspace='Performance90',
key='32303132303231305f677569645f66697273745f3130363 4323136',
modifications=[ColumnFamily(VideoMedia20120210
[file:false:31@1336999691335000,media:false:1048576@1336999691335001,])]
)
DEBUG [MutationStage:7773] 2012-05-14 05:44:03,084 Table.java (line 387)
applying mutation of row
32303132303231305f677569645f66697273745f31303634323136
DEBUG [MutationStage:7773] 2012-05-14 05:44:03,085
RowMutationVerbHandler.java (line 56)
RowMutation(keyspace='Performance90',
key='32303132303231305f677569645f66697273745f31303634323136',modificatio
ns=[ColumnFamily(VideoMedia20120210
[file:false:31@1336999691335000,media:false:1048576@1336999691335001,])]
)
applied.  Sending response to 3074618@/172.15.2.164
<http://172.15.2.164/> 
DEBUG [MutationStage:7774] 2012-05-14 05:44:07,026 StorageProxy.java
(line 357) Adding hint for /172.15.2.165 <http://172.15.2.165/> 
DEBUG [MutationStage:7774] 2012-05-14 05:44:07,026 Table.java (line 387)
applying mutation of row 66666666666666666666666666666664 

DEBUG [MutationStage:7774] 2012-05-14 05:44:07,026 SlabAllocator.java
(line 105) 1 regions now allocated in
org.apache.cassandra.utils.SlabAllocator@7dce40cc
DEBUG [MutationStage:7775] 2012-05-14 05:44:07,027 StorageProxy.java
(line 357) Adding hint for /172.15.2.165 <http://172.15.2.165/> 
DEBUG [MutationStage:7775] 2012-05-14 05:44:07,027 Table.java (line 387)
applying mutation of row 66666666666666666666666666666664 

DEBUG [Thread-1959] 2012-05-14 05:44:12,135 Gossiper.java (line 222)
Reseting version for /172.15.2.165 <http://172.15.2.165/> 
DEBUG [Thread-1960] 2012-05-14 05:44:12,135 Gossiper.java (line 222)
Reseting version for /172.15.2.165 <http://172.15.2.165/> 
 INFO [GossipTasks:1] 2012-05-14 05:44:12,626 Gossiper.java (line 818)
InetAddress /172.15.2.165 <http://172.15.2.165/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:12,626 MessagingService.java
(line 221) Resetting pool for /172.15.2.165 <http://172.15.2.165/> 

 

Pavel Polushkin.


Mime
View raw message