incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Rosello <orose...@corp.free.fr>
Subject Re: High CPU usage on all nodes without any read or write
Date Thu, 08 Jul 2010 13:44:54 GMT
Hi,

Thank you for your help.

I don't know if data is writing too fast to the cluster, but I don't think so (nodes are heavy,
big CPU, 12GB RAM...) and there is no so much data (2000 inserts/sec for about 300 KB/sec
of raw data).


I trashed all data yesterday 6pm (GMT+2) and launched all again.

All was fine since now : one node (total of 4) begins to make timeouts on write (see cfstats
and tpstats below).

CPU is between 100% and 250% (writing on cluster continues).


root@cassandra-2:~# iostat -x -k 5
Linux 2.6.31-22-server (cassandra-2) 	07/08/2010 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.35    0.00    0.60    0.02    0.00   96.03

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await
 svctm  %util
sda               0.92    96.85    0.34    2.35    44.31   396.77   328.83     0.08   28.18
  1.09   0.29

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17.95    0.00    1.27    0.05    0.00   80.73

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await
 svctm  %util
sda               0.00    76.00    0.60   10.60     2.40   346.40    62.29     0.00    0.36
  0.36   0.40

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17.40    0.00    0.15    0.00    0.00   82.45

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await
 svctm  %util
sda               0.00    23.00    0.00    1.20     0.00    96.80   161.33     0.00    0.00
  0.00   0.00



But in Cassandra output log :
root@cassandra-2:~#  tail -f /var/log/cassandra/output.log 
 INFO 15:32:05,390 GC for ConcurrentMarkSweep: 1359 ms, 4295787600 reclaimed leaving 1684169392
used; max is 6563430400
 INFO 15:32:09,875 GC for ConcurrentMarkSweep: 1363 ms, 4296991416 reclaimed leaving 1684201560
used; max is 6563430400
 INFO 15:32:14,370 GC for ConcurrentMarkSweep: 1341 ms, 4295467880 reclaimed leaving 1684879440
used; max is 6563430400
 INFO 15:32:18,906 GC for ConcurrentMarkSweep: 1343 ms, 4296386408 reclaimed leaving 1685489208
used; max is 6563430400
 INFO 15:32:23,564 GC for ConcurrentMarkSweep: 1511 ms, 4296407088 reclaimed leaving 1685488744
used; max is 6563430400
 INFO 15:32:28,068 GC for ConcurrentMarkSweep: 1347 ms, 4295383216 reclaimed leaving 1686469448
used; max is 6563430400
 INFO 15:32:32,617 GC for ConcurrentMarkSweep: 1376 ms, 4295689192 reclaimed leaving 1687908304
used; max is 6563430400
 INFO 15:32:37,283 GC for ConcurrentMarkSweep: 1468 ms, 4296056176 reclaimed leaving 1687916880
used; max is 6563430400
 INFO 15:32:41,811 GC for ConcurrentMarkSweep: 1358 ms, 4296412232 reclaimed leaving 1688437064
used; max is 6563430400
 INFO 15:32:46,436 GC for ConcurrentMarkSweep: 1368 ms, 4296105472 reclaimed leaving 1691050032
used; max is 6563430400
 INFO 15:32:51,180 GC for ConcurrentMarkSweep: 1545 ms, 4297439832 reclaimed leaving 1691033816
used; max is 6563430400
 INFO 15:32:55,703 GC for ConcurrentMarkSweep: 1379 ms, 4295491928 reclaimed leaving 1692891456
used; max is 6563430400
 INFO 15:33:00,328 GC for ConcurrentMarkSweep: 1378 ms, 4296657208 reclaimed leaving 1694981528
used; max is 6563430400

(this don't appears to other nodes, which are currently ok)


Is this value could be linked to the problem :
Compacted row maximum size: 1202492950
 ?

I supposed that uncompacted, the row may be bigger than 2^31 bytes as written here :
http://wiki.apache.org/cassandra/CassandraLimitations?highlight=(related%20limitation)

Keyspace: system
	Read Count: 184
	Read Latency: 156.3704456521739 ms.
	Write Count: 591571
	Write Latency: 0.8834233777517829 ms.
	Pending Tasks: 0
		Column Family: HintsColumnFamily
		SSTable count: 13
		Space used (live): 4974145
		Space used (total): 4974145
		Memtable Columns Count: 30430
		Memtable Data Size: 289085
		Memtable Switch Count: 55
		Read Count: 181
		Read Latency: 158,899 ms.
		Write Count: 591564
		Write Latency: 0,883 ms.
		Pending Tasks: 0
		Key cache capacity: 16
		Key cache size: 2
		Key cache hit rate: 0.75
		Row cache: disabled
		Compacted row minimum size: 1630
		Compacted row maximum size: 336507
		Compacted row mean size: 155823

		Column Family: LocationInfo
		SSTable count: 2
		Space used (live): 1225
		Space used (total): 1225
		Memtable Columns Count: 0
		Memtable Data Size: 0
		Memtable Switch Count: 2
		Read Count: 3
		Read Latency: 3,844 ms.
		Write Count: 7
		Write Latency: 0,222 ms.
		Pending Tasks: 0
		Key cache capacity: 2
		Key cache size: 0
		Key cache hit rate: NaN
		Row cache: disabled
		Compacted row minimum size: 0
		Compacted row maximum size: 0
		Compacted row mean size: 0

----------------
Keyspace: MyKeyspace
	Read Count: 15755
	Read Latency: 44.23593043478261 ms.
	Write Count: 41379835
	Write Latency: 0.07479275296771966 ms.
	Pending Tasks: 0
		Column Family: Index1
		SSTable count: 14
		Space used (live): 813507746
		Space used (total): 813507746
		Memtable Columns Count: 29656
		Memtable Data Size: 919336
		Memtable Switch Count: 68
		Read Count: 3151
		Read Latency: 1,226 ms.
		Write Count: 5362886
		Write Latency: 0,039 ms.
		Pending Tasks: 0
		Key cache capacity: 200000
		Key cache size: 2488
		Key cache hit rate: 0.0016019223067681217
		Row cache: disabled
		Compacted row minimum size: 241
		Compacted row maximum size: 66680
		Compacted row mean size: 283

		Column Family: Index2
		SSTable count: 10
		Space used (live): 694093939
		Space used (total): 694093939
		Memtable Columns Count: 10916
		Memtable Data Size: 338396
		Memtable Switch Count: 42
		Read Count: 3151
		Read Latency: 5,919 ms.
		Write Count: 3442595
		Write Latency: 0,033 ms.
		Pending Tasks: 0
		Key cache capacity: 200000
		Key cache size: 2813
		Key cache hit rate: 0.01915041782729805
		Row cache: disabled
		Compacted row minimum size: 238
		Compacted row maximum size: 28563
		Compacted row mean size: 264

		Column Family: RAW
		SSTable count: 18
		Space used (live): 8621580313
		Space used (total): 8621580313
		Memtable Columns Count: 67751
		Memtable Data Size: 23322692
		Memtable Switch Count: 322
		Read Count: 3151
		Read Latency: 141,666 ms.
		Write Count: 23739615
		Write Latency: 0,097 ms.
		Pending Tasks: 0
		Key cache capacity: 200000
		Key cache size: 20
		Key cache hit rate: 0.5
		Row cache: disabled
		Compacted row minimum size: 558
		Compacted row maximum size: 1202492950
		Compacted row mean size: 84949879

		Column Family: Index3
		SSTable count: 9
		Space used (live): 715605405
		Space used (total): 715605405
		Memtable Columns Count: 90842
		Memtable Data Size: 2816102
		Memtable Switch Count: 49
		Read Count: 3151
		Read Latency: 69,815 ms.
		Write Count: 3710520
		Write Latency: 0,068 ms.
		Pending Tasks: 0
		Key cache capacity: 200000
		Key cache size: 2369
		Key cache hit rate: 0.09592691282832128
		Row cache: disabled
		Compacted row minimum size: 237
		Compacted row maximum size: 928234
		Compacted row mean size: 269

		Column Family: Index4
		SSTable count: 11
		Space used (live): 1071327924
		Space used (total): 1071327924
		Memtable Columns Count: 91479
		Memtable Data Size: 2835849
		Memtable Switch Count: 59
		Read Count: 3151
		Read Latency: 2,553 ms.
		Write Count: 5124219
		Write Latency: 0,044 ms.
		Pending Tasks: 0
		Key cache capacity: 200000
		Key cache size: 3437
		Key cache hit rate: 0.0534392984379282
		Row cache: disabled
		Compacted row minimum size: 238
		Compacted row maximum size: 861800
		Compacted row mean size: 264



Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0            108
STREAM-STAGE                      0         0              0
RESPONSE-STAGE                    0         0       26411958
ROW-READ-STAGE                    0         0              0
LB-OPERATIONS                     0         0              0
MESSAGE-DESERIALIZER-POOL         0         0       56957780
GMFD                              0         0         246541
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0              0
ROW-MUTATION-STAGE                0         0       41288534
MESSAGE-STREAMING-POOL            0         0              0
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             0         0            465
FLUSH-WRITER-POOL                 0         0            465
AE-SERVICE-STAGE                  0         0              0
HINTED-HANDOFF-POOL               1        43             59



Thank you for your help,


Olivier


----- "Peter Schuller" <peter.schuller@infidyne.com> a écrit :

> De: "Peter Schuller" <peter.schuller@infidyne.com>
> À: user@cassandra.apache.org
> Envoyé: Mercredi 7 Juillet 2010 18:17:19
> Objet: Re: High CPU usage on all nodes without any read or write
>
> > It runs correctly during several days. Last night, we started to
> have timeout exception on insert and high cpu load on all nodes.
> >
> > We stopped inserts. But the CPU remains high (without any insert or
> read).
> 
> Has data been written to the cluster faster than background
> compaction
> is proceeding? If so you may see cassandra eating CPU (and doing I/O)
> in the background for extended periods of time even after you stop
> sending requests to it.
> 
> If this is what is happening it should be visible in the log that
> it's
> doing compaction, and you should see that the data directories
> contain
> lots of files (unless it's just now catching up) rather than the
> fairly few expectation when compaction is up to speed.
> 
> Also consider that even if you're not writing faster than it can
> handle, if you have lots of data in total, the bigger compactions
> will
> take a considerable mount of time so you may see CPU+disk activity
> for
> long periods even if all is otherwise well.
> 
> Of course you say your're seeing timeouts. Is is possible these are
> timeouts that happen during compaction in general? What kind of
> latency are we talking about (a few extra hundre millis or several
> seconds?) and is there a correlation between the timeouts and lots of
> data being flushed to disk (iostat -x -k 1)?
> 
> -- 
> / Peter Schuller

Mime
View raw message